Message ID | 20231212162935.42910-1-philmd@linaro.org |
---|---|
Headers | show |
Series | hw/cpu/arm: Remove one use of qemu_get_cpu() in A7/A15 MPCore priv | expand |
Hi, ping for review? On 12/12/23 17:29, Philippe Mathieu-Daudé wrote: > Hi, > > When a MPCore cluster is used, the Cortex-A cores belong the the > cluster container, not to the board/soc layer. This series move > the creation of vCPUs to the MPCore private container. > > Doing so we consolidate the QOM model, moving common code in a > central place (abstract MPCore parent). > > This eventually allow removing one qemu_get_cpu() use, which we > want to remove in heterogeneous machines (machines using MPCore > are candidate for heterogeneous emulation). > > Maybe these hw/cpu/arm/ files belong to hw/arm/... > > Regards, > > Phil. > > Philippe Mathieu-Daudé (33): > hw/arm/boot: Propagate vCPU to arm_load_dtb() > hw/arm/fsl-imx6: Add a local 'gic' variable > hw/arm/fsl-imx6ul: Add a local 'gic' variable > hw/arm/fsl-imx7: Add a local 'gic' variable > hw/cpu: Remove dead Kconfig > hw/cpu/arm: Rename 'busdev' -> 'gicsbd' in a15mp_priv_realize() > hw/cpu/arm: Alias 'num-cpu' property on TYPE_REALVIEW_MPCORE > hw/cpu/arm: Declare CPU QOM types using DEFINE_TYPES() macro > hw/cpu/arm: Merge {a9mpcore.h, a15mpcore.h} as cortex_mpcore.h > hw/cpu/arm: Introduce abstract CORTEX_MPCORE_PRIV QOM type > hw/cpu/arm: Have A9MPCORE/A15MPCORE inheritate common > CORTEX_MPCORE_PRIV > hw/cpu/arm: Create MPCore container in QOM parent > hw/cpu/arm: Handle 'num_cores' property once in MPCore parent > hw/cpu/arm: Handle 'has_el2/3' properties once in MPCore parent > hw/cpu/arm: Handle 'gic-irq' property once in MPCore parent > hw/cpu/arm: Handle GIC once in MPCore parent > hw/cpu/arm: Document more properties of CORTEX_MPCORE_PRIV QOM type > hw/cpu/arm: Replace A15MPPrivState by CortexMPPrivState > hw/cpu/arm: Introduce TYPE_A7MPCORE_PRIV for Cortex-A7 MPCore > hw/cpu/arm: Consolidate check on max GIC spi supported > hw/cpu/arm: Create CPUs once in MPCore parent > hw/arm/aspeed_ast2600: Let the A7MPcore create/wire the CPU cores > hw/arm/exynos4210: Let the A9MPcore create/wire the CPU cores > hw/arm/fsl-imx6: Let the A9MPcore create/wire the CPU cores > hw/arm/fsl-imx6ul: Let the A7MPcore create/wire the CPU cores > hw/arm/fsl-imx7: Let the A7MPcore create/wire the CPU cores > hw/arm/highbank: Let the A9/A15MPcore create/wire the CPU cores > hw/arm/vexpress: Let the A9/A15MPcore create/wire the CPU cores > hw/arm/xilinx_zynq: Let the A9MPcore create/wire the CPU cores > hw/arm/npcm7xx: Let the A9MPcore create/wire the CPU cores > hw/cpu/a9mpcore: Remove legacy code > hw/cpu/arm: Remove 'num-cpu' property alias > hw/cpu/arm: Remove use of qemu_get_cpu() in A7/A15 realize()
On 12/12/23 17:29, Philippe Mathieu-Daudé wrote: > Hi, > > When a MPCore cluster is used, the Cortex-A cores belong the the > cluster container, not to the board/soc layer. This series move > the creation of vCPUs to the MPCore private container. > > Doing so we consolidate the QOM model, moving common code in a > central place (abstract MPCore parent). Changing the QOM hierarchy has an impact on the state of the machine and some fixups are then required to maintain migration compatibility. This can become a real headache for KVM machines like virt for which migration compatibility is a feature, less for emulated ones. I don't have a good solution to propose to overcome this problem :/ C. > > This eventually allow removing one qemu_get_cpu() use, which we > want to remove in heterogeneous machines (machines using MPCore > are candidate for heterogeneous emulation). > > Maybe these hw/cpu/arm/ files belong to hw/arm/... > > Regards, > > Phil. > > Philippe Mathieu-Daudé (33): > hw/arm/boot: Propagate vCPU to arm_load_dtb() > hw/arm/fsl-imx6: Add a local 'gic' variable > hw/arm/fsl-imx6ul: Add a local 'gic' variable > hw/arm/fsl-imx7: Add a local 'gic' variable > hw/cpu: Remove dead Kconfig > hw/cpu/arm: Rename 'busdev' -> 'gicsbd' in a15mp_priv_realize() > hw/cpu/arm: Alias 'num-cpu' property on TYPE_REALVIEW_MPCORE > hw/cpu/arm: Declare CPU QOM types using DEFINE_TYPES() macro > hw/cpu/arm: Merge {a9mpcore.h, a15mpcore.h} as cortex_mpcore.h > hw/cpu/arm: Introduce abstract CORTEX_MPCORE_PRIV QOM type > hw/cpu/arm: Have A9MPCORE/A15MPCORE inheritate common > CORTEX_MPCORE_PRIV > hw/cpu/arm: Create MPCore container in QOM parent > hw/cpu/arm: Handle 'num_cores' property once in MPCore parent > hw/cpu/arm: Handle 'has_el2/3' properties once in MPCore parent > hw/cpu/arm: Handle 'gic-irq' property once in MPCore parent > hw/cpu/arm: Handle GIC once in MPCore parent > hw/cpu/arm: Document more properties of CORTEX_MPCORE_PRIV QOM type > hw/cpu/arm: Replace A15MPPrivState by CortexMPPrivState > hw/cpu/arm: Introduce TYPE_A7MPCORE_PRIV for Cortex-A7 MPCore > hw/cpu/arm: Consolidate check on max GIC spi supported > hw/cpu/arm: Create CPUs once in MPCore parent > hw/arm/aspeed_ast2600: Let the A7MPcore create/wire the CPU cores > hw/arm/exynos4210: Let the A9MPcore create/wire the CPU cores > hw/arm/fsl-imx6: Let the A9MPcore create/wire the CPU cores > hw/arm/fsl-imx6ul: Let the A7MPcore create/wire the CPU cores > hw/arm/fsl-imx7: Let the A7MPcore create/wire the CPU cores > hw/arm/highbank: Let the A9/A15MPcore create/wire the CPU cores > hw/arm/vexpress: Let the A9/A15MPcore create/wire the CPU cores > hw/arm/xilinx_zynq: Let the A9MPcore create/wire the CPU cores > hw/arm/npcm7xx: Let the A9MPcore create/wire the CPU cores > hw/cpu/a9mpcore: Remove legacy code > hw/cpu/arm: Remove 'num-cpu' property alias > hw/cpu/arm: Remove use of qemu_get_cpu() in A7/A15 realize() > > MAINTAINERS | 3 +- > include/hw/arm/aspeed_soc.h | 5 +- > include/hw/arm/boot.h | 4 +- > include/hw/arm/exynos4210.h | 6 +- > include/hw/arm/fsl-imx6.h | 6 +- > include/hw/arm/fsl-imx6ul.h | 8 +- > include/hw/arm/fsl-imx7.h | 8 +- > include/hw/arm/npcm7xx.h | 3 +- > include/hw/cpu/a15mpcore.h | 44 ------- > include/hw/cpu/a9mpcore.h | 39 ------- > include/hw/cpu/cortex_mpcore.h | 135 ++++++++++++++++++++++ > hw/arm/aspeed_ast2600.c | 61 ++++------ > hw/arm/boot.c | 11 +- > hw/arm/exynos4210.c | 60 ++++------ > hw/arm/exynos4_boards.c | 6 +- > hw/arm/fsl-imx6.c | 84 ++++---------- > hw/arm/fsl-imx6ul.c | 65 ++++------- > hw/arm/fsl-imx7.c | 103 +++++------------ > hw/arm/highbank.c | 56 ++------- > hw/arm/mcimx6ul-evk.c | 3 +- > hw/arm/mcimx7d-sabre.c | 3 +- > hw/arm/npcm7xx.c | 48 ++------ > hw/arm/realview.c | 4 +- > hw/arm/sabrelite.c | 4 +- > hw/arm/vexpress.c | 60 +++------- > hw/arm/virt.c | 2 +- > hw/arm/xilinx_zynq.c | 30 ++--- > hw/cpu/a15mpcore.c | 179 +++++++++++++---------------- > hw/cpu/a9mpcore.c | 138 +++++++++------------- > hw/cpu/arm11mpcore.c | 23 ++-- > hw/cpu/cortex_mpcore.c | 202 +++++++++++++++++++++++++++++++++ > hw/cpu/realview_mpcore.c | 30 ++--- > hw/arm/Kconfig | 8 +- > hw/cpu/Kconfig | 8 -- > hw/cpu/meson.build | 1 + > 35 files changed, 689 insertions(+), 761 deletions(-) > delete mode 100644 include/hw/cpu/a15mpcore.h > delete mode 100644 include/hw/cpu/a9mpcore.h > create mode 100644 include/hw/cpu/cortex_mpcore.h > create mode 100644 hw/cpu/cortex_mpcore.c > delete mode 100644 hw/cpu/Kconfig >
Hi Cédric, On 2/1/24 15:55, Cédric Le Goater wrote: > On 12/12/23 17:29, Philippe Mathieu-Daudé wrote: >> Hi, >> >> When a MPCore cluster is used, the Cortex-A cores belong the the >> cluster container, not to the board/soc layer. This series move >> the creation of vCPUs to the MPCore private container. >> >> Doing so we consolidate the QOM model, moving common code in a >> central place (abstract MPCore parent). > > Changing the QOM hierarchy has an impact on the state of the machine > and some fixups are then required to maintain migration compatibility. > This can become a real headache for KVM machines like virt for which > migration compatibility is a feature, less for emulated ones. All changes are either moving properties (which are not migrated) or moving non-migrated QOM members (i.e. pointers of ARMCPU, which is still migrated elsewhere). So I don't see any obvious migration problem, but I might be missing something, so I Cc'ed Juan :> > > I don't have a good solution to propose to overcome this problem :/ > > C. > > >> >> This eventually allow removing one qemu_get_cpu() use, which we >> want to remove in heterogeneous machines (machines using MPCore >> are candidate for heterogeneous emulation). >> include/hw/arm/aspeed_soc.h | 5 +- >> include/hw/arm/boot.h | 4 +- >> include/hw/arm/exynos4210.h | 6 +- >> include/hw/arm/fsl-imx6.h | 6 +- >> include/hw/arm/fsl-imx6ul.h | 8 +- >> include/hw/arm/fsl-imx7.h | 8 +- >> include/hw/arm/npcm7xx.h | 3 +- >> include/hw/cpu/a15mpcore.h | 44 ------- >> include/hw/cpu/a9mpcore.h | 39 ------- >> include/hw/cpu/cortex_mpcore.h | 135 ++++++++++++++++++++++
On 1/2/24 17:15, Philippe Mathieu-Daudé wrote: > Hi Cédric, > > On 2/1/24 15:55, Cédric Le Goater wrote: >> On 12/12/23 17:29, Philippe Mathieu-Daudé wrote: >>> Hi, >>> >>> When a MPCore cluster is used, the Cortex-A cores belong the the >>> cluster container, not to the board/soc layer. This series move >>> the creation of vCPUs to the MPCore private container. >>> >>> Doing so we consolidate the QOM model, moving common code in a >>> central place (abstract MPCore parent). >> >> Changing the QOM hierarchy has an impact on the state of the machine >> and some fixups are then required to maintain migration compatibility. >> This can become a real headache for KVM machines like virt for which >> migration compatibility is a feature, less for emulated ones. > > All changes are either moving properties (which are not migrated) > or moving non-migrated QOM members (i.e. pointers of ARMCPU, which > is still migrated elsewhere). So I don't see any obvious migration > problem, but I might be missing something, so I Cc'ed Juan :> We broke migration compatibility by moving the IC object in the QOM hierarchy of the pseries machines in the past. These changes might be different. Here is the QOM tree of the ast2600 SoC. before : /soc (ast2600-a3) /a7mpcore (a15mpcore_priv) /a15mp-priv-container[0] (memory-region) /gic (arm_gic) /gic_cpu[0] (memory-region) /gic_cpu[1] (memory-region) /gic_cpu[2] (memory-region) /gic_dist[0] (memory-region) /gic_vcpu[0] (memory-region) /gic_viface[0] (memory-region) /gic_viface[1] (memory-region) /gic_viface[2] (memory-region) /cpu[0] (cortex-a7-arm-cpu) /cpu[1] (cortex-a7-arm-cpu) after : /soc (ast2600-a3) /a7mpcore (a7mpcore_priv) /cpu[0] (cortex-a7-arm-cpu) /cpu[1] (cortex-a7-arm-cpu) /gic (arm_gic) /gic_cpu[0] (memory-region) /gic_cpu[1] (memory-region) /gic_cpu[2] (memory-region) /gic_dist[0] (memory-region) /mpcore-priv-container[0] (memory-region) Thanks, C.
+Peter/Fabiano On 2/1/24 17:41, Cédric Le Goater wrote: > On 1/2/24 17:15, Philippe Mathieu-Daudé wrote: >> Hi Cédric, >> >> On 2/1/24 15:55, Cédric Le Goater wrote: >>> On 12/12/23 17:29, Philippe Mathieu-Daudé wrote: >>>> Hi, >>>> >>>> When a MPCore cluster is used, the Cortex-A cores belong the the >>>> cluster container, not to the board/soc layer. This series move >>>> the creation of vCPUs to the MPCore private container. >>>> >>>> Doing so we consolidate the QOM model, moving common code in a >>>> central place (abstract MPCore parent). >>> >>> Changing the QOM hierarchy has an impact on the state of the machine >>> and some fixups are then required to maintain migration compatibility. >>> This can become a real headache for KVM machines like virt for which >>> migration compatibility is a feature, less for emulated ones. >> >> All changes are either moving properties (which are not migrated) >> or moving non-migrated QOM members (i.e. pointers of ARMCPU, which >> is still migrated elsewhere). So I don't see any obvious migration >> problem, but I might be missing something, so I Cc'ed Juan :> > We broke migration compatibility by moving the IC object in the QOM > hierarchy of the pseries machines in the past. These changes might > be different. Here is the QOM tree of the ast2600 SoC. > > before : > > /soc (ast2600-a3) > /a7mpcore (a15mpcore_priv) > /a15mp-priv-container[0] (memory-region) > /gic (arm_gic) > /gic_cpu[0] (memory-region) > /gic_cpu[1] (memory-region) > /gic_cpu[2] (memory-region) > /gic_dist[0] (memory-region) > /gic_vcpu[0] (memory-region) > /gic_viface[0] (memory-region) > /gic_viface[1] (memory-region) > /gic_viface[2] (memory-region) > /cpu[0] (cortex-a7-arm-cpu) > /cpu[1] (cortex-a7-arm-cpu) > > after : > > /soc (ast2600-a3) > /a7mpcore (a7mpcore_priv) > /cpu[0] (cortex-a7-arm-cpu) > /cpu[1] (cortex-a7-arm-cpu) > /gic (arm_gic) > /gic_cpu[0] (memory-region) > /gic_cpu[1] (memory-region) > /gic_cpu[2] (memory-region) > /gic_dist[0] (memory-region) > /mpcore-priv-container[0] (memory-region) > > > Thanks, > > C. > > >
Philippe Mathieu-Daudé <philmd@linaro.org> writes: > +Peter/Fabiano > > On 2/1/24 17:41, Cédric Le Goater wrote: >> On 1/2/24 17:15, Philippe Mathieu-Daudé wrote: >>> Hi Cédric, >>> >>> On 2/1/24 15:55, Cédric Le Goater wrote: >>>> On 12/12/23 17:29, Philippe Mathieu-Daudé wrote: >>>>> Hi, >>>>> >>>>> When a MPCore cluster is used, the Cortex-A cores belong the the >>>>> cluster container, not to the board/soc layer. This series move >>>>> the creation of vCPUs to the MPCore private container. >>>>> >>>>> Doing so we consolidate the QOM model, moving common code in a >>>>> central place (abstract MPCore parent). >>>> >>>> Changing the QOM hierarchy has an impact on the state of the machine >>>> and some fixups are then required to maintain migration compatibility. >>>> This can become a real headache for KVM machines like virt for which >>>> migration compatibility is a feature, less for emulated ones. >>> >>> All changes are either moving properties (which are not migrated) >>> or moving non-migrated QOM members (i.e. pointers of ARMCPU, which >>> is still migrated elsewhere). So I don't see any obvious migration >>> problem, but I might be missing something, so I Cc'ed Juan :> FWIW, I didn't spot anything problematic either. I've ran this through my migration compatibility series [1] and it doesn't regress aarch64 migration from/to 8.2. The tests use '-M virt -cpu max', so the cortex-a7 and cortex-a15 are not covered. I don't think we even support migration of anything non-KVM on arm. 1- https://gitlab.com/farosas/qemu/-/jobs/5853599533 >> We broke migration compatibility by moving the IC object in the QOM >> hierarchy of the pseries machines in the past. These changes might >> be different. Here is the QOM tree of the ast2600 SoC. >> >> before : >> >> /soc (ast2600-a3) >> /a7mpcore (a15mpcore_priv) >> /a15mp-priv-container[0] (memory-region) >> /gic (arm_gic) >> /gic_cpu[0] (memory-region) >> /gic_cpu[1] (memory-region) >> /gic_cpu[2] (memory-region) >> /gic_dist[0] (memory-region) >> /gic_vcpu[0] (memory-region) >> /gic_viface[0] (memory-region) >> /gic_viface[1] (memory-region) >> /gic_viface[2] (memory-region) >> /cpu[0] (cortex-a7-arm-cpu) >> /cpu[1] (cortex-a7-arm-cpu) >> >> after : >> >> /soc (ast2600-a3) >> /a7mpcore (a7mpcore_priv) >> /cpu[0] (cortex-a7-arm-cpu) >> /cpu[1] (cortex-a7-arm-cpu) >> /gic (arm_gic) >> /gic_cpu[0] (memory-region) >> /gic_cpu[1] (memory-region) >> /gic_cpu[2] (memory-region) >> /gic_dist[0] (memory-region) >> /mpcore-priv-container[0] (memory-region) >> >> >> Thanks, >> >> C. >> >> >>
On 1/3/24 20:53, Fabiano Rosas wrote: > Philippe Mathieu-Daudé <philmd@linaro.org> writes: > >> +Peter/Fabiano >> >> On 2/1/24 17:41, Cédric Le Goater wrote: >>> On 1/2/24 17:15, Philippe Mathieu-Daudé wrote: >>>> Hi Cédric, >>>> >>>> On 2/1/24 15:55, Cédric Le Goater wrote: >>>>> On 12/12/23 17:29, Philippe Mathieu-Daudé wrote: >>>>>> Hi, >>>>>> >>>>>> When a MPCore cluster is used, the Cortex-A cores belong the the >>>>>> cluster container, not to the board/soc layer. This series move >>>>>> the creation of vCPUs to the MPCore private container. >>>>>> >>>>>> Doing so we consolidate the QOM model, moving common code in a >>>>>> central place (abstract MPCore parent). >>>>> >>>>> Changing the QOM hierarchy has an impact on the state of the machine >>>>> and some fixups are then required to maintain migration compatibility. >>>>> This can become a real headache for KVM machines like virt for which >>>>> migration compatibility is a feature, less for emulated ones. >>>> >>>> All changes are either moving properties (which are not migrated) >>>> or moving non-migrated QOM members (i.e. pointers of ARMCPU, which >>>> is still migrated elsewhere). So I don't see any obvious migration >>>> problem, but I might be missing something, so I Cc'ed Juan :> > > FWIW, I didn't spot anything problematic either. > > I've ran this through my migration compatibility series [1] and it > doesn't regress aarch64 migration from/to 8.2. The tests use '-M > virt -cpu max', so the cortex-a7 and cortex-a15 are not covered. I don't > think we even support migration of anything non-KVM on arm. it happens we do. > > 1- https://gitlab.com/farosas/qemu/-/jobs/5853599533 yes it depends on the QOM hierarchy and virt seems immune to the changes. Good. However, changing the QOM topology clearly breaks migration compat, at least on the Aspeed SoC. The question is : do we care ? For Aspeed probably not, but the series should say so. Thanks, C. > >>> We broke migration compatibility by moving the IC object in the QOM >>> hierarchy of the pseries machines in the past. These changes might >>> be different. Here is the QOM tree of the ast2600 SoC. >>> >>> before : >>> >>> /soc (ast2600-a3) >>> /a7mpcore (a15mpcore_priv) >>> /a15mp-priv-container[0] (memory-region) >>> /gic (arm_gic) >>> /gic_cpu[0] (memory-region) >>> /gic_cpu[1] (memory-region) >>> /gic_cpu[2] (memory-region) >>> /gic_dist[0] (memory-region) >>> /gic_vcpu[0] (memory-region) >>> /gic_viface[0] (memory-region) >>> /gic_viface[1] (memory-region) >>> /gic_viface[2] (memory-region) >>> /cpu[0] (cortex-a7-arm-cpu) >>> /cpu[1] (cortex-a7-arm-cpu) >>> >>> after : >>> >>> /soc (ast2600-a3) >>> /a7mpcore (a7mpcore_priv) >>> /cpu[0] (cortex-a7-arm-cpu) >>> /cpu[1] (cortex-a7-arm-cpu) >>> /gic (arm_gic) >>> /gic_cpu[0] (memory-region) >>> /gic_cpu[1] (memory-region) >>> /gic_cpu[2] (memory-region) >>> /gic_dist[0] (memory-region) >>> /mpcore-priv-container[0] (memory-region) >>> >>> >>> Thanks, >>> >>> C. >>> >>> >>>
Cédric Le Goater <clg@kaod.org> writes: > On 1/3/24 20:53, Fabiano Rosas wrote: >> Philippe Mathieu-Daudé <philmd@linaro.org> writes: >> >>> +Peter/Fabiano >>> >>> On 2/1/24 17:41, Cédric Le Goater wrote: >>>> On 1/2/24 17:15, Philippe Mathieu-Daudé wrote: >>>>> Hi Cédric, >>>>> >>>>> On 2/1/24 15:55, Cédric Le Goater wrote: >>>>>> On 12/12/23 17:29, Philippe Mathieu-Daudé wrote: >>>>>>> Hi, >>>>>>> >>>>>>> When a MPCore cluster is used, the Cortex-A cores belong the the >>>>>>> cluster container, not to the board/soc layer. This series move >>>>>>> the creation of vCPUs to the MPCore private container. >>>>>>> >>>>>>> Doing so we consolidate the QOM model, moving common code in a >>>>>>> central place (abstract MPCore parent). >>>>>> >>>>>> Changing the QOM hierarchy has an impact on the state of the machine >>>>>> and some fixups are then required to maintain migration compatibility. >>>>>> This can become a real headache for KVM machines like virt for which >>>>>> migration compatibility is a feature, less for emulated ones. >>>>> >>>>> All changes are either moving properties (which are not migrated) >>>>> or moving non-migrated QOM members (i.e. pointers of ARMCPU, which >>>>> is still migrated elsewhere). So I don't see any obvious migration >>>>> problem, but I might be missing something, so I Cc'ed Juan :> >> >> FWIW, I didn't spot anything problematic either. >> >> I've ran this through my migration compatibility series [1] and it >> doesn't regress aarch64 migration from/to 8.2. The tests use '-M >> virt -cpu max', so the cortex-a7 and cortex-a15 are not covered. I don't >> think we even support migration of anything non-KVM on arm. > > it happens we do. > Oh, sorry, I didn't mean TCG here. Probably meant to say something like non-KVM-capable cpus, as in 32-bit. Nevermind. >> >> 1- https://gitlab.com/farosas/qemu/-/jobs/5853599533 > > yes it depends on the QOM hierarchy and virt seems immune to the changes. > Good. > > However, changing the QOM topology clearly breaks migration compat, Well, "clearly" is relative =) You've mentioned pseries and aspeed already, do you have a pointer to one of those cases were we broke migration so I could take a look?
On 1/9/24 18:40, Fabiano Rosas wrote: > Cédric Le Goater <clg@kaod.org> writes: > >> On 1/3/24 20:53, Fabiano Rosas wrote: >>> Philippe Mathieu-Daudé <philmd@linaro.org> writes: >>> >>>> +Peter/Fabiano >>>> >>>> On 2/1/24 17:41, Cédric Le Goater wrote: >>>>> On 1/2/24 17:15, Philippe Mathieu-Daudé wrote: >>>>>> Hi Cédric, >>>>>> >>>>>> On 2/1/24 15:55, Cédric Le Goater wrote: >>>>>>> On 12/12/23 17:29, Philippe Mathieu-Daudé wrote: >>>>>>>> Hi, >>>>>>>> >>>>>>>> When a MPCore cluster is used, the Cortex-A cores belong the the >>>>>>>> cluster container, not to the board/soc layer. This series move >>>>>>>> the creation of vCPUs to the MPCore private container. >>>>>>>> >>>>>>>> Doing so we consolidate the QOM model, moving common code in a >>>>>>>> central place (abstract MPCore parent). >>>>>>> >>>>>>> Changing the QOM hierarchy has an impact on the state of the machine >>>>>>> and some fixups are then required to maintain migration compatibility. >>>>>>> This can become a real headache for KVM machines like virt for which >>>>>>> migration compatibility is a feature, less for emulated ones. >>>>>> >>>>>> All changes are either moving properties (which are not migrated) >>>>>> or moving non-migrated QOM members (i.e. pointers of ARMCPU, which >>>>>> is still migrated elsewhere). So I don't see any obvious migration >>>>>> problem, but I might be missing something, so I Cc'ed Juan :> >>> >>> FWIW, I didn't spot anything problematic either. >>> >>> I've ran this through my migration compatibility series [1] and it >>> doesn't regress aarch64 migration from/to 8.2. The tests use '-M >>> virt -cpu max', so the cortex-a7 and cortex-a15 are not covered. I don't >>> think we even support migration of anything non-KVM on arm. >> >> it happens we do. >> > > Oh, sorry, I didn't mean TCG here. Probably meant to say something like > non-KVM-capable cpus, as in 32-bit. Nevermind. Theoretically, we should be able to migrate to a TCG guest. Well, this worked in the past for PPC. When I was doing more KVM related changes, this was very useful for dev. Also, some machines are partially emulated. Anyhow I agree this is not a strong requirement and we often break it. Let's focus on KVM only. >>> 1- https://gitlab.com/farosas/qemu/-/jobs/5853599533 >> >> yes it depends on the QOM hierarchy and virt seems immune to the changes. >> Good. >> >> However, changing the QOM topology clearly breaks migration compat, > > Well, "clearly" is relative =) You've mentioned pseries and aspeed > already, do you have a pointer to one of those cases were we broke > migration Regarding pseries, migration compat broke because of 5bc8d26de20c ("spapr: allocate the ICPState object from under sPAPRCPUCore") which is similar to the changes proposed by this series, it impacts the QOM hierarchy. Here is the workaround/fix from Greg : 46f7afa37096 ("spapr: fix migration of ICPState objects from/to older QEMU") which is quite an headache and this turned out to raise another problem some months ago ... :/ That's why I sent [1] to prepare removal of old machines and workarounds becoming a burden. Regarding aspeed, this series breaks compat. Not that we care much but this caught my attention because of my past experience on pseries. Same kind of QOM change which could impact other machines, like virt. Since you checked that migration compat is preserved on virt, we should be fine. Thanks, C. [1] https://lore.kernel.org/qemu-devel/20231214181723.1520854-1-clg@kaod.org/
Cédric Le Goater <clg@kaod.org> writes: > On 1/9/24 18:40, Fabiano Rosas wrote: >> Cédric Le Goater <clg@kaod.org> writes: >> >>> On 1/3/24 20:53, Fabiano Rosas wrote: >>>> Philippe Mathieu-Daudé <philmd@linaro.org> writes: >>>> >>>>> +Peter/Fabiano >>>>> >>>>> On 2/1/24 17:41, Cédric Le Goater wrote: >>>>>> On 1/2/24 17:15, Philippe Mathieu-Daudé wrote: >>>>>>> Hi Cédric, >>>>>>> >>>>>>> On 2/1/24 15:55, Cédric Le Goater wrote: >>>>>>>> On 12/12/23 17:29, Philippe Mathieu-Daudé wrote: >>>>>>>>> Hi, >>>>>>>>> >>>>>>>>> When a MPCore cluster is used, the Cortex-A cores belong the the >>>>>>>>> cluster container, not to the board/soc layer. This series move >>>>>>>>> the creation of vCPUs to the MPCore private container. >>>>>>>>> >>>>>>>>> Doing so we consolidate the QOM model, moving common code in a >>>>>>>>> central place (abstract MPCore parent). >>>>>>>> >>>>>>>> Changing the QOM hierarchy has an impact on the state of the machine >>>>>>>> and some fixups are then required to maintain migration compatibility. >>>>>>>> This can become a real headache for KVM machines like virt for which >>>>>>>> migration compatibility is a feature, less for emulated ones. >>>>>>> >>>>>>> All changes are either moving properties (which are not migrated) >>>>>>> or moving non-migrated QOM members (i.e. pointers of ARMCPU, which >>>>>>> is still migrated elsewhere). So I don't see any obvious migration >>>>>>> problem, but I might be missing something, so I Cc'ed Juan :> >>>> >>>> FWIW, I didn't spot anything problematic either. >>>> >>>> I've ran this through my migration compatibility series [1] and it >>>> doesn't regress aarch64 migration from/to 8.2. The tests use '-M >>>> virt -cpu max', so the cortex-a7 and cortex-a15 are not covered. I don't >>>> think we even support migration of anything non-KVM on arm. >>> >>> it happens we do. >>> >> >> Oh, sorry, I didn't mean TCG here. Probably meant to say something like >> non-KVM-capable cpus, as in 32-bit. Nevermind. > > Theoretically, we should be able to migrate to a TCG guest. Well, this > worked in the past for PPC. When I was doing more KVM related changes, > this was very useful for dev. Also, some machines are partially emulated. > Anyhow I agree this is not a strong requirement and we often break it. > Let's focus on KVM only. > >>>> 1- https://gitlab.com/farosas/qemu/-/jobs/5853599533 >>> >>> yes it depends on the QOM hierarchy and virt seems immune to the changes. >>> Good. >>> >>> However, changing the QOM topology clearly breaks migration compat, >> >> Well, "clearly" is relative =) You've mentioned pseries and aspeed >> already, do you have a pointer to one of those cases were we broke >> migration > > Regarding pseries, migration compat broke because of 5bc8d26de20c > ("spapr: allocate the ICPState object from under sPAPRCPUCore") which > is similar to the changes proposed by this series, it impacts the QOM > hierarchy. Here is the workaround/fix from Greg : 46f7afa37096 > ("spapr: fix migration of ICPState objects from/to older QEMU") which > is quite an headache and this turned out to raise another problem some > months ago ... :/ That's why I sent [1] to prepare removal of old > machines and workarounds becoming a burden. This feels like something that could be handled by the vmstate code somehow. The state is there, just under a different path. No one wants to be policing QOM hierarchy changes in every single series that shows up on the list. Anyway, thanks for the pointers. I'll study that code a bit more, maybe I can come up with some way to handle these cases. Hopefully between the analyze-migration test and the compat tests we'll catch the next bug of this kind before it gets merged.
Hi Cédric, On 9/1/24 19:06, Cédric Le Goater wrote: > On 1/9/24 18:40, Fabiano Rosas wrote: >> Cédric Le Goater <clg@kaod.org> writes: >> >>> On 1/3/24 20:53, Fabiano Rosas wrote: >>>> Philippe Mathieu-Daudé <philmd@linaro.org> writes: >>>> >>>>> +Peter/Fabiano >>>>> >>>>> On 2/1/24 17:41, Cédric Le Goater wrote: >>>>>> On 1/2/24 17:15, Philippe Mathieu-Daudé wrote: >>>>>>> Hi Cédric, >>>>>>> >>>>>>> On 2/1/24 15:55, Cédric Le Goater wrote: >>>>>>>> On 12/12/23 17:29, Philippe Mathieu-Daudé wrote: >>>>>>>>> Hi, >>>>>>>>> >>>>>>>>> When a MPCore cluster is used, the Cortex-A cores belong the the >>>>>>>>> cluster container, not to the board/soc layer. This series move >>>>>>>>> the creation of vCPUs to the MPCore private container. >>>>>>>>> >>>>>>>>> Doing so we consolidate the QOM model, moving common code in a >>>>>>>>> central place (abstract MPCore parent). >>>>>>>> >>>>>>>> Changing the QOM hierarchy has an impact on the state of the >>>>>>>> machine >>>>>>>> and some fixups are then required to maintain migration >>>>>>>> compatibility. >>>>>>>> This can become a real headache for KVM machines like virt for >>>>>>>> which >>>>>>>> migration compatibility is a feature, less for emulated ones. >>>>>>> >>>>>>> All changes are either moving properties (which are not migrated) >>>>>>> or moving non-migrated QOM members (i.e. pointers of ARMCPU, which >>>>>>> is still migrated elsewhere). So I don't see any obvious migration >>>>>>> problem, but I might be missing something, so I Cc'ed Juan :> >>>> >>>> FWIW, I didn't spot anything problematic either. >>>> >>>> I've ran this through my migration compatibility series [1] and it >>>> doesn't regress aarch64 migration from/to 8.2. The tests use '-M >>>> virt -cpu max', so the cortex-a7 and cortex-a15 are not covered. I >>>> don't >>>> think we even support migration of anything non-KVM on arm. >>> >>> it happens we do. >>> >> >> Oh, sorry, I didn't mean TCG here. Probably meant to say something like >> non-KVM-capable cpus, as in 32-bit. Nevermind. > > Theoretically, we should be able to migrate to a TCG guest. Well, this > worked in the past for PPC. When I was doing more KVM related changes, > this was very useful for dev. Also, some machines are partially emulated. > Anyhow I agree this is not a strong requirement and we often break it. > Let's focus on KVM only. No no, we want the same for TCG. >>>> 1- https://gitlab.com/farosas/qemu/-/jobs/5853599533 >>> >>> yes it depends on the QOM hierarchy and virt seems immune to the >>> changes. >>> Good. >>> >>> However, changing the QOM topology clearly breaks migration compat, >> >> Well, "clearly" is relative =) You've mentioned pseries and aspeed >> already, do you have a pointer to one of those cases were we broke >> migration > > Regarding pseries, migration compat broke because of 5bc8d26de20c > ("spapr: allocate the ICPState object from under sPAPRCPUCore") which > is similar to the changes proposed by this series, it impacts the QOM > hierarchy. Here is the workaround/fix from Greg : 46f7afa37096 > ("spapr: fix migration of ICPState objects from/to older QEMU") which > is quite an headache and this turned out to raise another problem some > months ago ... :/ That's why I sent [1] to prepare removal of old > machines and workarounds becoming a burden. > > Regarding aspeed, this series breaks compat. Can you write down the steps to reproduce please? I'll debug it. We need to understand this. > Not that we care much > but this caught my attention because of my past experience on pseries. > Same kind of QOM change which could impact other machines, like virt. > Since you checked that migration compat is preserved on virt, we should > be fine. > > Thanks, > > C. > > [1] > https://lore.kernel.org/qemu-devel/20231214181723.1520854-1-clg@kaod.org/ >
On 9/1/24 22:07, Philippe Mathieu-Daudé wrote: > Hi Cédric, > > On 9/1/24 19:06, Cédric Le Goater wrote: >> On 1/9/24 18:40, Fabiano Rosas wrote: >>> Cédric Le Goater <clg@kaod.org> writes: >>> >>>> On 1/3/24 20:53, Fabiano Rosas wrote: >>>>> Philippe Mathieu-Daudé <philmd@linaro.org> writes: >>>>> >>>>>> +Peter/Fabiano >>>>>> >>>>>> On 2/1/24 17:41, Cédric Le Goater wrote: >>>>>>> On 1/2/24 17:15, Philippe Mathieu-Daudé wrote: >>>>>>>> Hi Cédric, >>>>>>>> >>>>>>>> On 2/1/24 15:55, Cédric Le Goater wrote: >>>>>>>>> On 12/12/23 17:29, Philippe Mathieu-Daudé wrote: >>>>>>>>>> Hi, >>>>>>>>>> >>>>>>>>>> When a MPCore cluster is used, the Cortex-A cores belong the the >>>>>>>>>> cluster container, not to the board/soc layer. This series move >>>>>>>>>> the creation of vCPUs to the MPCore private container. >>>>>>>>>> >>>>>>>>>> Doing so we consolidate the QOM model, moving common code in a >>>>>>>>>> central place (abstract MPCore parent). >>>>>>>>> >>>>>>>>> Changing the QOM hierarchy has an impact on the state of the >>>>>>>>> machine >>>>>>>>> and some fixups are then required to maintain migration >>>>>>>>> compatibility. >>>>>>>>> This can become a real headache for KVM machines like virt for >>>>>>>>> which >>>>>>>>> migration compatibility is a feature, less for emulated ones. >>>>>>>> >>>>>>>> All changes are either moving properties (which are not migrated) >>>>>>>> or moving non-migrated QOM members (i.e. pointers of ARMCPU, which >>>>>>>> is still migrated elsewhere). So I don't see any obvious migration >>>>>>>> problem, but I might be missing something, so I Cc'ed Juan :> >>>>> >>>>> FWIW, I didn't spot anything problematic either. >>>>> >>>>> I've ran this through my migration compatibility series [1] and it >>>>> doesn't regress aarch64 migration from/to 8.2. The tests use '-M >>>>> virt -cpu max', so the cortex-a7 and cortex-a15 are not covered. I >>>>> don't >>>>> think we even support migration of anything non-KVM on arm. >>>> >>>> it happens we do. >>>> >>> >>> Oh, sorry, I didn't mean TCG here. Probably meant to say something like >>> non-KVM-capable cpus, as in 32-bit. Nevermind. >> >> Theoretically, we should be able to migrate to a TCG guest. Well, this >> worked in the past for PPC. When I was doing more KVM related changes, >> this was very useful for dev. Also, some machines are partially emulated. >> Anyhow I agree this is not a strong requirement and we often break it. >> Let's focus on KVM only. > > No no, we want the same for TCG. > >>>>> 1- https://gitlab.com/farosas/qemu/-/jobs/5853599533 >>>> >>>> yes it depends on the QOM hierarchy and virt seems immune to the >>>> changes. >>>> Good. >>>> >>>> However, changing the QOM topology clearly breaks migration compat, >>> >>> Well, "clearly" is relative =) You've mentioned pseries and aspeed >>> already, do you have a pointer to one of those cases were we broke >>> migration >> >> Regarding pseries, migration compat broke because of 5bc8d26de20c >> ("spapr: allocate the ICPState object from under sPAPRCPUCore") which >> is similar to the changes proposed by this series, it impacts the QOM >> hierarchy. Here is the workaround/fix from Greg : 46f7afa37096 >> ("spapr: fix migration of ICPState objects from/to older QEMU") which >> is quite an headache and this turned out to raise another problem some >> months ago ... :/ That's why I sent [1] to prepare removal of old >> machines and workarounds becoming a burden. >> >> Regarding aspeed, this series breaks compat. > > Can you write down the steps to reproduce please? I'll debug it. Also, have you figured (bisecting) which patch start to break? > We need to understand this. > >> Not that we care much >> but this caught my attention because of my past experience on pseries. >> Same kind of QOM change which could impact other machines, like virt. >> Since you checked that migration compat is preserved on virt, we should >> be fine. >> >> Thanks, >> >> C. >> >> [1] >> https://lore.kernel.org/qemu-devel/20231214181723.1520854-1-clg@kaod.org/ >> >
Hi Fabiano, On 9/1/24 21:21, Fabiano Rosas wrote: > Cédric Le Goater <clg@kaod.org> writes: > >> On 1/9/24 18:40, Fabiano Rosas wrote: >>> Cédric Le Goater <clg@kaod.org> writes: >>> >>>> On 1/3/24 20:53, Fabiano Rosas wrote: >>>>> Philippe Mathieu-Daudé <philmd@linaro.org> writes: >>>>> >>>>>> +Peter/Fabiano >>>>>> >>>>>> On 2/1/24 17:41, Cédric Le Goater wrote: >>>>>>> On 1/2/24 17:15, Philippe Mathieu-Daudé wrote: >>>>>>>> Hi Cédric, >>>>>>>> >>>>>>>> On 2/1/24 15:55, Cédric Le Goater wrote: >>>>>>>>> On 12/12/23 17:29, Philippe Mathieu-Daudé wrote: >>>>>>>>>> Hi, >>>>>>>>>> >>>>>>>>>> When a MPCore cluster is used, the Cortex-A cores belong the the >>>>>>>>>> cluster container, not to the board/soc layer. This series move >>>>>>>>>> the creation of vCPUs to the MPCore private container. >>>>>>>>>> >>>>>>>>>> Doing so we consolidate the QOM model, moving common code in a >>>>>>>>>> central place (abstract MPCore parent). >>>>>>>>> >>>>>>>>> Changing the QOM hierarchy has an impact on the state of the machine >>>>>>>>> and some fixups are then required to maintain migration compatibility. >>>>>>>>> This can become a real headache for KVM machines like virt for which >>>>>>>>> migration compatibility is a feature, less for emulated ones. >>>>>>>> >>>>>>>> All changes are either moving properties (which are not migrated) >>>>>>>> or moving non-migrated QOM members (i.e. pointers of ARMCPU, which >>>>>>>> is still migrated elsewhere). So I don't see any obvious migration >>>>>>>> problem, but I might be missing something, so I Cc'ed Juan :> >>>>> >>>>> FWIW, I didn't spot anything problematic either. >>>>> >>>>> I've ran this through my migration compatibility series [1] and it >>>>> doesn't regress aarch64 migration from/to 8.2. The tests use '-M >>>>> virt -cpu max', so the cortex-a7 and cortex-a15 are not covered. I don't >>>>> think we even support migration of anything non-KVM on arm. >>>> >>>> it happens we do. >>>> >>> >>> Oh, sorry, I didn't mean TCG here. Probably meant to say something like >>> non-KVM-capable cpus, as in 32-bit. Nevermind. >> >> Theoretically, we should be able to migrate to a TCG guest. Well, this >> worked in the past for PPC. When I was doing more KVM related changes, >> this was very useful for dev. Also, some machines are partially emulated. >> Anyhow I agree this is not a strong requirement and we often break it. >> Let's focus on KVM only. >> >>>>> 1- https://gitlab.com/farosas/qemu/-/jobs/5853599533 >>>> >>>> yes it depends on the QOM hierarchy and virt seems immune to the changes. >>>> Good. >>>> >>>> However, changing the QOM topology clearly breaks migration compat, >>> >>> Well, "clearly" is relative =) You've mentioned pseries and aspeed >>> already, do you have a pointer to one of those cases were we broke >>> migration >> >> Regarding pseries, migration compat broke because of 5bc8d26de20c >> ("spapr: allocate the ICPState object from under sPAPRCPUCore") which >> is similar to the changes proposed by this series, it impacts the QOM >> hierarchy. Here is the workaround/fix from Greg : 46f7afa37096 >> ("spapr: fix migration of ICPState objects from/to older QEMU") which >> is quite an headache and this turned out to raise another problem some >> months ago ... :/ That's why I sent [1] to prepare removal of old >> machines and workarounds becoming a burden. > > This feels like something that could be handled by the vmstate code > somehow. The state is there, just under a different path. What, the QOM path is used in migration? ... See recent discussions on "QOM path stability": https://lore.kernel.org/qemu-devel/ZZfYvlmcxBCiaeWE@redhat.com/ https://lore.kernel.org/qemu-devel/87jzojbxt7.fsf@pond.sub.org/ https://lore.kernel.org/qemu-devel/87v883by34.fsf@pond.sub.org/ > No one wants > to be policing QOM hierarchy changes in every single series that shows > up on the list. > > Anyway, thanks for the pointers. I'll study that code a bit more, maybe > I can come up with some way to handle these cases. > > Hopefully between the analyze-migration test and the compat tests we'll > catch the next bug of this kind before it gets merged. > >
On Tue, Jan 09, 2024 at 10:22:31PM +0100, Philippe Mathieu-Daudé wrote: > Hi Fabiano, > > On 9/1/24 21:21, Fabiano Rosas wrote: > > Cédric Le Goater <clg@kaod.org> writes: > > > > > On 1/9/24 18:40, Fabiano Rosas wrote: > > > > Cédric Le Goater <clg@kaod.org> writes: > > > > > > > > > On 1/3/24 20:53, Fabiano Rosas wrote: > > > > > > Philippe Mathieu-Daudé <philmd@linaro.org> writes: > > > > > > > > > > > > > +Peter/Fabiano > > > > > > > > > > > > > > On 2/1/24 17:41, Cédric Le Goater wrote: > > > > > > > > On 1/2/24 17:15, Philippe Mathieu-Daudé wrote: > > > > > > > > > Hi Cédric, > > > > > > > > > > > > > > > > > > On 2/1/24 15:55, Cédric Le Goater wrote: > > > > > > > > > > On 12/12/23 17:29, Philippe Mathieu-Daudé wrote: > > > > > > > > > > > Hi, > > > > > > > > > > > > > > > > > > > > > > When a MPCore cluster is used, the Cortex-A cores belong the the > > > > > > > > > > > cluster container, not to the board/soc layer. This series move > > > > > > > > > > > the creation of vCPUs to the MPCore private container. > > > > > > > > > > > > > > > > > > > > > > Doing so we consolidate the QOM model, moving common code in a > > > > > > > > > > > central place (abstract MPCore parent). > > > > > > > > > > > > > > > > > > > > Changing the QOM hierarchy has an impact on the state of the machine > > > > > > > > > > and some fixups are then required to maintain migration compatibility. > > > > > > > > > > This can become a real headache for KVM machines like virt for which > > > > > > > > > > migration compatibility is a feature, less for emulated ones. > > > > > > > > > > > > > > > > > > All changes are either moving properties (which are not migrated) > > > > > > > > > or moving non-migrated QOM members (i.e. pointers of ARMCPU, which > > > > > > > > > is still migrated elsewhere). So I don't see any obvious migration > > > > > > > > > problem, but I might be missing something, so I Cc'ed Juan :> > > > > > > > > > > > > FWIW, I didn't spot anything problematic either. > > > > > > > > > > > > I've ran this through my migration compatibility series [1] and it > > > > > > doesn't regress aarch64 migration from/to 8.2. The tests use '-M > > > > > > virt -cpu max', so the cortex-a7 and cortex-a15 are not covered. I don't > > > > > > think we even support migration of anything non-KVM on arm. > > > > > > > > > > it happens we do. > > > > > > > > > > > > > Oh, sorry, I didn't mean TCG here. Probably meant to say something like > > > > non-KVM-capable cpus, as in 32-bit. Nevermind. > > > > > > Theoretically, we should be able to migrate to a TCG guest. Well, this > > > worked in the past for PPC. When I was doing more KVM related changes, > > > this was very useful for dev. Also, some machines are partially emulated. > > > Anyhow I agree this is not a strong requirement and we often break it. > > > Let's focus on KVM only. > > > > > > > > > 1- https://gitlab.com/farosas/qemu/-/jobs/5853599533 > > > > > > > > > > yes it depends on the QOM hierarchy and virt seems immune to the changes. > > > > > Good. > > > > > > > > > > However, changing the QOM topology clearly breaks migration compat, > > > > > > > > Well, "clearly" is relative =) You've mentioned pseries and aspeed > > > > already, do you have a pointer to one of those cases were we broke > > > > migration > > > > > > Regarding pseries, migration compat broke because of 5bc8d26de20c > > > ("spapr: allocate the ICPState object from under sPAPRCPUCore") which > > > is similar to the changes proposed by this series, it impacts the QOM > > > hierarchy. Here is the workaround/fix from Greg : 46f7afa37096 > > > ("spapr: fix migration of ICPState objects from/to older QEMU") which > > > is quite an headache and this turned out to raise another problem some > > > months ago ... :/ That's why I sent [1] to prepare removal of old > > > machines and workarounds becoming a burden. > > > > This feels like something that could be handled by the vmstate code > > somehow. The state is there, just under a different path. > > What, the QOM path is used in migration? ... Hopefully not.. > > See recent discussions on "QOM path stability": > https://lore.kernel.org/qemu-devel/ZZfYvlmcxBCiaeWE@redhat.com/ > https://lore.kernel.org/qemu-devel/87jzojbxt7.fsf@pond.sub.org/ > https://lore.kernel.org/qemu-devel/87v883by34.fsf@pond.sub.org/ If I read it right, the commit 46f7afa37096 example is pretty special that the QOM path more or less decided more than the hierachy itself but changes the existances of objects. > > > No one wants > > to be policing QOM hierarchy changes in every single series that shows > > up on the list. > > > > Anyway, thanks for the pointers. I'll study that code a bit more, maybe > > I can come up with some way to handle these cases. > > > > Hopefully between the analyze-migration test and the compat tests we'll > > catch the next bug of this kind before it gets merged. Things like that might be able to be detected via vmstate-static-checker.py. But I'm not 100% sure, also its coverage is limited. For example, I don't think it can detect changes to objects that will only be created dynamically, e.g., I think sometimes we create objects after some guest behaviors (consider guest enables the device, then QEMU emulation creates some objects on demand of device setup?), and since the static checker shouldn't ever start the VM and run any code, they won't trigger.
Peter Xu <peterx@redhat.com> writes: > On Tue, Jan 09, 2024 at 10:22:31PM +0100, Philippe Mathieu-Daudé wrote: >> Hi Fabiano, >> >> On 9/1/24 21:21, Fabiano Rosas wrote: >> > Cédric Le Goater <clg@kaod.org> writes: >> > >> > > On 1/9/24 18:40, Fabiano Rosas wrote: >> > > > Cédric Le Goater <clg@kaod.org> writes: >> > > > >> > > > > On 1/3/24 20:53, Fabiano Rosas wrote: >> > > > > > Philippe Mathieu-Daudé <philmd@linaro.org> writes: >> > > > > > >> > > > > > > +Peter/Fabiano >> > > > > > > >> > > > > > > On 2/1/24 17:41, Cédric Le Goater wrote: >> > > > > > > > On 1/2/24 17:15, Philippe Mathieu-Daudé wrote: >> > > > > > > > > Hi Cédric, >> > > > > > > > > >> > > > > > > > > On 2/1/24 15:55, Cédric Le Goater wrote: >> > > > > > > > > > On 12/12/23 17:29, Philippe Mathieu-Daudé wrote: >> > > > > > > > > > > Hi, >> > > > > > > > > > > >> > > > > > > > > > > When a MPCore cluster is used, the Cortex-A cores belong the the >> > > > > > > > > > > cluster container, not to the board/soc layer. This series move >> > > > > > > > > > > the creation of vCPUs to the MPCore private container. >> > > > > > > > > > > >> > > > > > > > > > > Doing so we consolidate the QOM model, moving common code in a >> > > > > > > > > > > central place (abstract MPCore parent). >> > > > > > > > > > >> > > > > > > > > > Changing the QOM hierarchy has an impact on the state of the machine >> > > > > > > > > > and some fixups are then required to maintain migration compatibility. >> > > > > > > > > > This can become a real headache for KVM machines like virt for which >> > > > > > > > > > migration compatibility is a feature, less for emulated ones. >> > > > > > > > > >> > > > > > > > > All changes are either moving properties (which are not migrated) >> > > > > > > > > or moving non-migrated QOM members (i.e. pointers of ARMCPU, which >> > > > > > > > > is still migrated elsewhere). So I don't see any obvious migration >> > > > > > > > > problem, but I might be missing something, so I Cc'ed Juan :> >> > > > > > >> > > > > > FWIW, I didn't spot anything problematic either. >> > > > > > >> > > > > > I've ran this through my migration compatibility series [1] and it >> > > > > > doesn't regress aarch64 migration from/to 8.2. The tests use '-M >> > > > > > virt -cpu max', so the cortex-a7 and cortex-a15 are not covered. I don't >> > > > > > think we even support migration of anything non-KVM on arm. >> > > > > >> > > > > it happens we do. >> > > > > >> > > > >> > > > Oh, sorry, I didn't mean TCG here. Probably meant to say something like >> > > > non-KVM-capable cpus, as in 32-bit. Nevermind. >> > > >> > > Theoretically, we should be able to migrate to a TCG guest. Well, this >> > > worked in the past for PPC. When I was doing more KVM related changes, >> > > this was very useful for dev. Also, some machines are partially emulated. >> > > Anyhow I agree this is not a strong requirement and we often break it. >> > > Let's focus on KVM only. >> > > >> > > > > > 1- https://gitlab.com/farosas/qemu/-/jobs/5853599533 >> > > > > >> > > > > yes it depends on the QOM hierarchy and virt seems immune to the changes. >> > > > > Good. >> > > > > >> > > > > However, changing the QOM topology clearly breaks migration compat, >> > > > >> > > > Well, "clearly" is relative =) You've mentioned pseries and aspeed >> > > > already, do you have a pointer to one of those cases were we broke >> > > > migration >> > > >> > > Regarding pseries, migration compat broke because of 5bc8d26de20c >> > > ("spapr: allocate the ICPState object from under sPAPRCPUCore") which >> > > is similar to the changes proposed by this series, it impacts the QOM >> > > hierarchy. Here is the workaround/fix from Greg : 46f7afa37096 >> > > ("spapr: fix migration of ICPState objects from/to older QEMU") which >> > > is quite an headache and this turned out to raise another problem some >> > > months ago ... :/ That's why I sent [1] to prepare removal of old >> > > machines and workarounds becoming a burden. >> > >> > This feels like something that could be handled by the vmstate code >> > somehow. The state is there, just under a different path. >> >> What, the QOM path is used in migration? ... > > Hopefully not.. > >> >> See recent discussions on "QOM path stability": >> https://lore.kernel.org/qemu-devel/ZZfYvlmcxBCiaeWE@redhat.com/ >> https://lore.kernel.org/qemu-devel/87jzojbxt7.fsf@pond.sub.org/ >> https://lore.kernel.org/qemu-devel/87v883by34.fsf@pond.sub.org/ > > If I read it right, the commit 46f7afa37096 example is pretty special that > the QOM path more or less decided more than the hierachy itself but changes > the existances of objects. Let's see whether I got this... We removed some useless objects, moved the useful ones to another home. The move changed their QOM path. The problem was the removal of useless objects, because this also removed their vmstate. The fix was adding the vmstate back as a dummy. The QOM patch changes are *not* part of the problem. Correct? >> > No one wants >> > to be policing QOM hierarchy changes in every single series that shows >> > up on the list. >> > >> > Anyway, thanks for the pointers. I'll study that code a bit more, maybe >> > I can come up with some way to handle these cases. >> > >> > Hopefully between the analyze-migration test and the compat tests we'll >> > catch the next bug of this kind before it gets merged. > > Things like that might be able to be detected via vmstate-static-checker.py. > But I'm not 100% sure, also its coverage is limited. > > For example, I don't think it can detect changes to objects that will only > be created dynamically, e.g., I think sometimes we create objects after > some guest behaviors (consider guest enables the device, then QEMU > emulation creates some objects on demand of device setup?), Feels nuts to me. In real hardware, software enabling a device that is disabled by default doesn't create the device. The device is always there, it just happens to be inactive unless enabled. We should model the device just like that. > and since the > static checker shouldn't ever start the VM and run any code, they won't > trigger.
On Wed, Jan 10, 2024 at 07:03:06AM +0100, Markus Armbruster wrote: > Peter Xu <peterx@redhat.com> writes: > > > On Tue, Jan 09, 2024 at 10:22:31PM +0100, Philippe Mathieu-Daudé wrote: > >> Hi Fabiano, > >> > >> On 9/1/24 21:21, Fabiano Rosas wrote: > >> > Cédric Le Goater <clg@kaod.org> writes: > >> > > >> > > On 1/9/24 18:40, Fabiano Rosas wrote: > >> > > > Cédric Le Goater <clg@kaod.org> writes: > >> > > > > >> > > > > On 1/3/24 20:53, Fabiano Rosas wrote: > >> > > > > > Philippe Mathieu-Daudé <philmd@linaro.org> writes: > >> > > > > > > >> > > > > > > +Peter/Fabiano > >> > > > > > > > >> > > > > > > On 2/1/24 17:41, Cédric Le Goater wrote: > >> > > > > > > > On 1/2/24 17:15, Philippe Mathieu-Daudé wrote: > >> > > > > > > > > Hi Cédric, > >> > > > > > > > > > >> > > > > > > > > On 2/1/24 15:55, Cédric Le Goater wrote: > >> > > > > > > > > > On 12/12/23 17:29, Philippe Mathieu-Daudé wrote: > >> > > > > > > > > > > Hi, > >> > > > > > > > > > > > >> > > > > > > > > > > When a MPCore cluster is used, the Cortex-A cores belong the the > >> > > > > > > > > > > cluster container, not to the board/soc layer. This series move > >> > > > > > > > > > > the creation of vCPUs to the MPCore private container. > >> > > > > > > > > > > > >> > > > > > > > > > > Doing so we consolidate the QOM model, moving common code in a > >> > > > > > > > > > > central place (abstract MPCore parent). > >> > > > > > > > > > > >> > > > > > > > > > Changing the QOM hierarchy has an impact on the state of the machine > >> > > > > > > > > > and some fixups are then required to maintain migration compatibility. > >> > > > > > > > > > This can become a real headache for KVM machines like virt for which > >> > > > > > > > > > migration compatibility is a feature, less for emulated ones. > >> > > > > > > > > > >> > > > > > > > > All changes are either moving properties (which are not migrated) > >> > > > > > > > > or moving non-migrated QOM members (i.e. pointers of ARMCPU, which > >> > > > > > > > > is still migrated elsewhere). So I don't see any obvious migration > >> > > > > > > > > problem, but I might be missing something, so I Cc'ed Juan :> > >> > > > > > > >> > > > > > FWIW, I didn't spot anything problematic either. > >> > > > > > > >> > > > > > I've ran this through my migration compatibility series [1] and it > >> > > > > > doesn't regress aarch64 migration from/to 8.2. The tests use '-M > >> > > > > > virt -cpu max', so the cortex-a7 and cortex-a15 are not covered. I don't > >> > > > > > think we even support migration of anything non-KVM on arm. > >> > > > > > >> > > > > it happens we do. > >> > > > > > >> > > > > >> > > > Oh, sorry, I didn't mean TCG here. Probably meant to say something like > >> > > > non-KVM-capable cpus, as in 32-bit. Nevermind. > >> > > > >> > > Theoretically, we should be able to migrate to a TCG guest. Well, this > >> > > worked in the past for PPC. When I was doing more KVM related changes, > >> > > this was very useful for dev. Also, some machines are partially emulated. > >> > > Anyhow I agree this is not a strong requirement and we often break it. > >> > > Let's focus on KVM only. > >> > > > >> > > > > > 1- https://gitlab.com/farosas/qemu/-/jobs/5853599533 > >> > > > > > >> > > > > yes it depends on the QOM hierarchy and virt seems immune to the changes. > >> > > > > Good. > >> > > > > > >> > > > > However, changing the QOM topology clearly breaks migration compat, > >> > > > > >> > > > Well, "clearly" is relative =) You've mentioned pseries and aspeed > >> > > > already, do you have a pointer to one of those cases were we broke > >> > > > migration > >> > > > >> > > Regarding pseries, migration compat broke because of 5bc8d26de20c > >> > > ("spapr: allocate the ICPState object from under sPAPRCPUCore") which > >> > > is similar to the changes proposed by this series, it impacts the QOM > >> > > hierarchy. Here is the workaround/fix from Greg : 46f7afa37096 > >> > > ("spapr: fix migration of ICPState objects from/to older QEMU") which > >> > > is quite an headache and this turned out to raise another problem some > >> > > months ago ... :/ That's why I sent [1] to prepare removal of old > >> > > machines and workarounds becoming a burden. > >> > > >> > This feels like something that could be handled by the vmstate code > >> > somehow. The state is there, just under a different path. > >> > >> What, the QOM path is used in migration? ... > > > > Hopefully not.. > > > >> > >> See recent discussions on "QOM path stability": > >> https://lore.kernel.org/qemu-devel/ZZfYvlmcxBCiaeWE@redhat.com/ > >> https://lore.kernel.org/qemu-devel/87jzojbxt7.fsf@pond.sub.org/ > >> https://lore.kernel.org/qemu-devel/87v883by34.fsf@pond.sub.org/ > > > > If I read it right, the commit 46f7afa37096 example is pretty special that > > the QOM path more or less decided more than the hierachy itself but changes > > the existances of objects. > > Let's see whether I got this... > > We removed some useless objects, moved the useful ones to another home. > The move changed their QOM path. > > The problem was the removal of useless objects, because this also > removed their vmstate. > > The fix was adding the vmstate back as a dummy. > > The QOM patch changes are *not* part of the problem. > > Correct? [I'd leave this to Cedric] > > >> > No one wants > >> > to be policing QOM hierarchy changes in every single series that shows > >> > up on the list. > >> > > >> > Anyway, thanks for the pointers. I'll study that code a bit more, maybe > >> > I can come up with some way to handle these cases. > >> > > >> > Hopefully between the analyze-migration test and the compat tests we'll > >> > catch the next bug of this kind before it gets merged. > > > > Things like that might be able to be detected via vmstate-static-checker.py. > > But I'm not 100% sure, also its coverage is limited. > > > > For example, I don't think it can detect changes to objects that will only > > be created dynamically, e.g., I think sometimes we create objects after > > some guest behaviors (consider guest enables the device, then QEMU > > emulation creates some objects on demand of device setup?), > > Feels nuts to me. > > In real hardware, software enabling a device that is disabled by default > doesn't create the device. The device is always there, it just happens > to be inactive unless enabled. We should model the device just like > that. It doesn't need to be the device itself to be dynamically created, but some other sub-objects that do not require to exist until the device is enabled, or some specific function of that device is enabled. It is logically doable. Is the example Cedric provided looks like some case like this? I am not sure, that's also why I'm not sure the static checker would work here. But logically it seems possible, e.g. with migration VMSD needed() facilities. Consider a device has a sub-function that requires a sub-object. It may not need to migrate that object if that sub-feature is not even enabled. If that object is very large, it might be wise to do so if possible to not send chunks of junk during the VM downtime. But then after a 2nd thought I do agree it's probably not sensible, because even if the src may know whether the sub-object will be needed, there's probably no good way for the dest QEMU to know. It can only know in something like a post_load() hook, but logically that can happen only after a full load of that device state, so might already be too late. Thanks,
Peter Xu <peterx@redhat.com> writes: > On Wed, Jan 10, 2024 at 07:03:06AM +0100, Markus Armbruster wrote: >> Peter Xu <peterx@redhat.com> writes: >> >> > On Tue, Jan 09, 2024 at 10:22:31PM +0100, Philippe Mathieu-Daudé wrote: >> >> Hi Fabiano, >> >> >> >> On 9/1/24 21:21, Fabiano Rosas wrote: [...] >> >> > No one wants >> >> > to be policing QOM hierarchy changes in every single series that shows >> >> > up on the list. >> >> > >> >> > Anyway, thanks for the pointers. I'll study that code a bit more, maybe >> >> > I can come up with some way to handle these cases. >> >> > >> >> > Hopefully between the analyze-migration test and the compat tests we'll >> >> > catch the next bug of this kind before it gets merged. >> > >> > Things like that might be able to be detected via vmstate-static-checker.py. >> > But I'm not 100% sure, also its coverage is limited. >> > >> > For example, I don't think it can detect changes to objects that will only >> > be created dynamically, e.g., I think sometimes we create objects after >> > some guest behaviors (consider guest enables the device, then QEMU >> > emulation creates some objects on demand of device setup?), >> >> Feels nuts to me. >> >> In real hardware, software enabling a device that is disabled by default >> doesn't create the device. The device is always there, it just happens >> to be inactive unless enabled. We should model the device just like >> that. > > It doesn't need to be the device itself to be dynamically created, but some > other sub-objects that do not require to exist until the device is enabled, > or some specific function of that device is enabled. It is logically doable. > > Is the example Cedric provided looks like some case like this? I am not > sure, that's also why I'm not sure the static checker would work here. But > logically it seems possible, e.g. with migration VMSD needed() facilities. > Consider a device has a sub-function that requires a sub-object. It may > not need to migrate that object if that sub-feature is not even enabled. > If that object is very large, it might be wise to do so if possible to not > send chunks of junk during the VM downtime. > > But then after a 2nd thought I do agree it's probably not sensible, because > even if the src may know whether the sub-object will be needed, there's > probably no good way for the dest QEMU to know. It can only know in > something like a post_load() hook, but logically that can happen only after > a full load of that device state, so might already be too late. > > Thanks, If an object has state that needs to be migrated only sometimes, and that part of the state is large enough to bother, we can put it in an optional subsection, can't we? Destination: if present, take it. If absent, initialize to default. Source: send unless (known to be) in default state.
On Wed, Jan 10, 2024 at 09:09:51AM +0100, Markus Armbruster wrote: > If an object has state that needs to be migrated only sometimes, and > that part of the state is large enough to bother, we can put it in an > optional subsection, can't we? > > Destination: if present, take it. If absent, initialize to default. > > Source: send unless (known to be) in default state. Hmm.. correct. I think I messed up VMSD's needed() hook with field_exists() of the fields; my apologies. The trick should be that VMSD's subsections is more flexible, due to the fact that vmstate_subsection_load() has: while (qemu_peek_byte(f, 0) == QEMU_VM_SUBSECTION) { ... sub_vmsd = vmstate_get_subsection(vmsd->subsections, idstr); ... } So it tolerates the case where the subsection doesn't exist, or out of order subsections. While field_exists() hook seems not that flexible, as it's implemented as this in vmstate_load_state(): while (field->name) { ... if (vmstate_field_exists(vmsd, field, opaque, version_id)) { ... } ... field++; } So that vmstate_field_exists() needs to be known even on dest before the main vmsd got loaded, and it should always return the same value as the source. Also, the field has ordering requirements. Then yes, subsection should work for dynamic objects. Thanks,
Markus Armbruster <armbru@redhat.com> writes: > Peter Xu <peterx@redhat.com> writes: > >> On Tue, Jan 09, 2024 at 10:22:31PM +0100, Philippe Mathieu-Daudé wrote: >>> Hi Fabiano, >>> >>> On 9/1/24 21:21, Fabiano Rosas wrote: >>> > Cédric Le Goater <clg@kaod.org> writes: >>> > >>> > > On 1/9/24 18:40, Fabiano Rosas wrote: >>> > > > Cédric Le Goater <clg@kaod.org> writes: >>> > > > >>> > > > > On 1/3/24 20:53, Fabiano Rosas wrote: >>> > > > > > Philippe Mathieu-Daudé <philmd@linaro.org> writes: >>> > > > > > >>> > > > > > > +Peter/Fabiano >>> > > > > > > >>> > > > > > > On 2/1/24 17:41, Cédric Le Goater wrote: >>> > > > > > > > On 1/2/24 17:15, Philippe Mathieu-Daudé wrote: >>> > > > > > > > > Hi Cédric, >>> > > > > > > > > >>> > > > > > > > > On 2/1/24 15:55, Cédric Le Goater wrote: >>> > > > > > > > > > On 12/12/23 17:29, Philippe Mathieu-Daudé wrote: >>> > > > > > > > > > > Hi, >>> > > > > > > > > > > >>> > > > > > > > > > > When a MPCore cluster is used, the Cortex-A cores belong the the >>> > > > > > > > > > > cluster container, not to the board/soc layer. This series move >>> > > > > > > > > > > the creation of vCPUs to the MPCore private container. >>> > > > > > > > > > > >>> > > > > > > > > > > Doing so we consolidate the QOM model, moving common code in a >>> > > > > > > > > > > central place (abstract MPCore parent). >>> > > > > > > > > > >>> > > > > > > > > > Changing the QOM hierarchy has an impact on the state of the machine >>> > > > > > > > > > and some fixups are then required to maintain migration compatibility. >>> > > > > > > > > > This can become a real headache for KVM machines like virt for which >>> > > > > > > > > > migration compatibility is a feature, less for emulated ones. >>> > > > > > > > > >>> > > > > > > > > All changes are either moving properties (which are not migrated) >>> > > > > > > > > or moving non-migrated QOM members (i.e. pointers of ARMCPU, which >>> > > > > > > > > is still migrated elsewhere). So I don't see any obvious migration >>> > > > > > > > > problem, but I might be missing something, so I Cc'ed Juan :> >>> > > > > > >>> > > > > > FWIW, I didn't spot anything problematic either. >>> > > > > > >>> > > > > > I've ran this through my migration compatibility series [1] and it >>> > > > > > doesn't regress aarch64 migration from/to 8.2. The tests use '-M >>> > > > > > virt -cpu max', so the cortex-a7 and cortex-a15 are not covered. I don't >>> > > > > > think we even support migration of anything non-KVM on arm. >>> > > > > >>> > > > > it happens we do. >>> > > > > >>> > > > >>> > > > Oh, sorry, I didn't mean TCG here. Probably meant to say something like >>> > > > non-KVM-capable cpus, as in 32-bit. Nevermind. >>> > > >>> > > Theoretically, we should be able to migrate to a TCG guest. Well, this >>> > > worked in the past for PPC. When I was doing more KVM related changes, >>> > > this was very useful for dev. Also, some machines are partially emulated. >>> > > Anyhow I agree this is not a strong requirement and we often break it. >>> > > Let's focus on KVM only. >>> > > >>> > > > > > 1- https://gitlab.com/farosas/qemu/-/jobs/5853599533 >>> > > > > >>> > > > > yes it depends on the QOM hierarchy and virt seems immune to the changes. >>> > > > > Good. >>> > > > > >>> > > > > However, changing the QOM topology clearly breaks migration compat, >>> > > > >>> > > > Well, "clearly" is relative =) You've mentioned pseries and aspeed >>> > > > already, do you have a pointer to one of those cases were we broke >>> > > > migration >>> > > >>> > > Regarding pseries, migration compat broke because of 5bc8d26de20c >>> > > ("spapr: allocate the ICPState object from under sPAPRCPUCore") which >>> > > is similar to the changes proposed by this series, it impacts the QOM >>> > > hierarchy. Here is the workaround/fix from Greg : 46f7afa37096 >>> > > ("spapr: fix migration of ICPState objects from/to older QEMU") which >>> > > is quite an headache and this turned out to raise another problem some >>> > > months ago ... :/ That's why I sent [1] to prepare removal of old >>> > > machines and workarounds becoming a burden. >>> > >>> > This feels like something that could be handled by the vmstate code >>> > somehow. The state is there, just under a different path. >>> >>> What, the QOM path is used in migration? ... >> >> Hopefully not.. Unfortunately the original fix doesn't mention _what_ actually broke with migration. I assumed the QOM path was needed because otherwise I don't think the fix makes sense. The thread discussing that patch also directly mentions the QOM path: https://www.mail-archive.com/qemu-devel@nongnu.org/msg450912.html But I probably misunderstood something while reading that thread. >> >>> >>> See recent discussions on "QOM path stability": >>> https://lore.kernel.org/qemu-devel/ZZfYvlmcxBCiaeWE@redhat.com/ >>> https://lore.kernel.org/qemu-devel/87jzojbxt7.fsf@pond.sub.org/ >>> https://lore.kernel.org/qemu-devel/87v883by34.fsf@pond.sub.org/ >> >> If I read it right, the commit 46f7afa37096 example is pretty special that >> the QOM path more or less decided more than the hierachy itself but changes >> the existances of objects. > > Let's see whether I got this... > > We removed some useless objects, moved the useful ones to another home. > The move changed their QOM path. > > The problem was the removal of useless objects, because this also > removed their vmstate. If you checkout at the removal commit (5bc8d26de20c), the vmstate has been kept untouched. > > The fix was adding the vmstate back as a dummy. Since the vmstate was kept I don't see why would we need a dummy. The incoming migration stream would still have the state, only at a different point in the stream. It's surprising to me that that would cause an issue, but I'm not well versed in that code. > > The QOM patch changes are *not* part of the problem. The only explanation I can come up with is that after the patch migration has broken after a hotplug or similar operation. In such situation, the preallocated state would always be present before the patch, but sometimes not present after the patch in case, say, a hot-unplug has taken away a cpu + ICPState.
Fabiano Rosas <farosas@suse.de> writes: > Markus Armbruster <armbru@redhat.com> writes: > >> Peter Xu <peterx@redhat.com> writes: >> >>> On Tue, Jan 09, 2024 at 10:22:31PM +0100, Philippe Mathieu-Daudé wrote: >>>> Hi Fabiano, >>>> >>>> On 9/1/24 21:21, Fabiano Rosas wrote: >>>> > Cédric Le Goater <clg@kaod.org> writes: >>>> > >>>> > > On 1/9/24 18:40, Fabiano Rosas wrote: >>>> > > > Cédric Le Goater <clg@kaod.org> writes: >>>> > > > >>>> > > > > On 1/3/24 20:53, Fabiano Rosas wrote: >>>> > > > > > Philippe Mathieu-Daudé <philmd@linaro.org> writes: >>>> > > > > > >>>> > > > > > > +Peter/Fabiano >>>> > > > > > > >>>> > > > > > > On 2/1/24 17:41, Cédric Le Goater wrote: >>>> > > > > > > > On 1/2/24 17:15, Philippe Mathieu-Daudé wrote: >>>> > > > > > > > > Hi Cédric, >>>> > > > > > > > > >>>> > > > > > > > > On 2/1/24 15:55, Cédric Le Goater wrote: >>>> > > > > > > > > > On 12/12/23 17:29, Philippe Mathieu-Daudé wrote: >>>> > > > > > > > > > > Hi, >>>> > > > > > > > > > > >>>> > > > > > > > > > > When a MPCore cluster is used, the Cortex-A cores belong the the >>>> > > > > > > > > > > cluster container, not to the board/soc layer. This series move >>>> > > > > > > > > > > the creation of vCPUs to the MPCore private container. >>>> > > > > > > > > > > >>>> > > > > > > > > > > Doing so we consolidate the QOM model, moving common code in a >>>> > > > > > > > > > > central place (abstract MPCore parent). >>>> > > > > > > > > > >>>> > > > > > > > > > Changing the QOM hierarchy has an impact on the state of the machine >>>> > > > > > > > > > and some fixups are then required to maintain migration compatibility. >>>> > > > > > > > > > This can become a real headache for KVM machines like virt for which >>>> > > > > > > > > > migration compatibility is a feature, less for emulated ones. >>>> > > > > > > > > >>>> > > > > > > > > All changes are either moving properties (which are not migrated) >>>> > > > > > > > > or moving non-migrated QOM members (i.e. pointers of ARMCPU, which >>>> > > > > > > > > is still migrated elsewhere). So I don't see any obvious migration >>>> > > > > > > > > problem, but I might be missing something, so I Cc'ed Juan :> >>>> > > > > > >>>> > > > > > FWIW, I didn't spot anything problematic either. >>>> > > > > > >>>> > > > > > I've ran this through my migration compatibility series [1] and it >>>> > > > > > doesn't regress aarch64 migration from/to 8.2. The tests use '-M >>>> > > > > > virt -cpu max', so the cortex-a7 and cortex-a15 are not covered. I don't >>>> > > > > > think we even support migration of anything non-KVM on arm. >>>> > > > > >>>> > > > > it happens we do. >>>> > > > > >>>> > > > >>>> > > > Oh, sorry, I didn't mean TCG here. Probably meant to say something like >>>> > > > non-KVM-capable cpus, as in 32-bit. Nevermind. >>>> > > >>>> > > Theoretically, we should be able to migrate to a TCG guest. Well, this >>>> > > worked in the past for PPC. When I was doing more KVM related changes, >>>> > > this was very useful for dev. Also, some machines are partially emulated. >>>> > > Anyhow I agree this is not a strong requirement and we often break it. >>>> > > Let's focus on KVM only. >>>> > > >>>> > > > > > 1- https://gitlab.com/farosas/qemu/-/jobs/5853599533 >>>> > > > > >>>> > > > > yes it depends on the QOM hierarchy and virt seems immune to the changes. >>>> > > > > Good. >>>> > > > > >>>> > > > > However, changing the QOM topology clearly breaks migration compat, >>>> > > > >>>> > > > Well, "clearly" is relative =) You've mentioned pseries and aspeed >>>> > > > already, do you have a pointer to one of those cases were we broke >>>> > > > migration >>>> > > >>>> > > Regarding pseries, migration compat broke because of 5bc8d26de20c >>>> > > ("spapr: allocate the ICPState object from under sPAPRCPUCore") which >>>> > > is similar to the changes proposed by this series, it impacts the QOM >>>> > > hierarchy. Here is the workaround/fix from Greg : 46f7afa37096 >>>> > > ("spapr: fix migration of ICPState objects from/to older QEMU") which >>>> > > is quite an headache and this turned out to raise another problem some >>>> > > months ago ... :/ That's why I sent [1] to prepare removal of old >>>> > > machines and workarounds becoming a burden. >>>> > >>>> > This feels like something that could be handled by the vmstate code >>>> > somehow. The state is there, just under a different path. >>>> >>>> What, the QOM path is used in migration? ... >>> >>> Hopefully not.. > > Unfortunately the original fix doesn't mention _what_ actually broke > with migration. I assumed the QOM path was needed because otherwise I > don't think the fix makes sense. The thread discussing that patch also > directly mentions the QOM path: > > https://www.mail-archive.com/qemu-devel@nongnu.org/msg450912.html > > But I probably misunderstood something while reading that thread. > >>> >>>> >>>> See recent discussions on "QOM path stability": >>>> https://lore.kernel.org/qemu-devel/ZZfYvlmcxBCiaeWE@redhat.com/ >>>> https://lore.kernel.org/qemu-devel/87jzojbxt7.fsf@pond.sub.org/ >>>> https://lore.kernel.org/qemu-devel/87v883by34.fsf@pond.sub.org/ >>> >>> If I read it right, the commit 46f7afa37096 example is pretty special that >>> the QOM path more or less decided more than the hierachy itself but changes >>> the existances of objects. >> >> Let's see whether I got this... >> >> We removed some useless objects, moved the useful ones to another home. >> The move changed their QOM path. >> >> The problem was the removal of useless objects, because this also >> removed their vmstate. > > If you checkout at the removal commit (5bc8d26de20c), the vmstate has > been kept untouched. > >> >> The fix was adding the vmstate back as a dummy. > > Since the vmstate was kept I don't see why would we need a dummy. The > incoming migration stream would still have the state, only at a > different point in the stream. It's surprising to me that that would > cause an issue, but I'm not well versed in that code. Alright, I understand neither the problem nor the fix :) >> The QOM patch changes are *not* part of the problem. > > The only explanation I can come up with is that after the patch > migration has broken after a hotplug or similar operation. In such > situation, the preallocated state would always be present before the > patch, but sometimes not present after the patch in case, say, a > hot-unplug has taken away a cpu + ICPState. My head hurts... Oh, we're talking migration! Perfectly normal then.
>> Regarding pseries, migration compat broke because of 5bc8d26de20c >> ("spapr: allocate the ICPState object from under sPAPRCPUCore") which >> is similar to the changes proposed by this series, it impacts the QOM >> hierarchy. Here is the workaround/fix from Greg : 46f7afa37096 >> ("spapr: fix migration of ICPState objects from/to older QEMU") which >> is quite an headache and this turned out to raise another problem some >> months ago ... :/ That's why I sent [1] to prepare removal of old >> machines and workarounds becoming a burden. >> >> Regarding aspeed, this series breaks compat. > > Can you write down the steps to reproduce please? I'll debug it. > We need to understand this. Nothing complex, $ wget https://github.com/legoater/qemu-aspeed-boot/raw/master/images/ast2600-evb/buildroot-2023.11/flash.img $ qemu-system-arm -M ast2600-evb -net user -drive file=./flash.img,format=raw,if=mtd -nographic -snapshot -serial mon:stdio -trace vmstate* -trace save* -trace load* $ qemu-system-arm-patch -M ast2600-evb -net user -drive file=./flash.img,format=raw,if=mtd -nographic -snapshot -serial mon:stdio -trace vmstate* -trace save* -trace load* -incoming tcp::1234 stop the VM in U-boot before loading the kernel because QEMU does not support migrating CPU when in secure mode. That's how I understood what Peter told me. (qemu) migrate tcp:localhost:1234 ... vmstate_load_state_field cpu:cpreg_vmstate_array_len vmstate_n_elems cpreg_vmstate_array_len: 1 qemu-system-arm: Invalid value 266 expecting positive value <= 223 qemu-system-arm: Failed to load cpu:cpreg_vmstate_array_len vmstate_load_field_error field "cpreg_vmstate_array_len" load failed, ret = -22 qemu-system-arm: error while loading state for instance 0x0 of device 'cpu' Thanks, C.
On 1/9/24 22:09, Philippe Mathieu-Daudé wrote: > On 9/1/24 22:07, Philippe Mathieu-Daudé wrote: >> Hi Cédric, >> >> On 9/1/24 19:06, Cédric Le Goater wrote: >>> On 1/9/24 18:40, Fabiano Rosas wrote: >>>> Cédric Le Goater <clg@kaod.org> writes: >>>> >>>>> On 1/3/24 20:53, Fabiano Rosas wrote: >>>>>> Philippe Mathieu-Daudé <philmd@linaro.org> writes: >>>>>> >>>>>>> +Peter/Fabiano >>>>>>> >>>>>>> On 2/1/24 17:41, Cédric Le Goater wrote: >>>>>>>> On 1/2/24 17:15, Philippe Mathieu-Daudé wrote: >>>>>>>>> Hi Cédric, >>>>>>>>> >>>>>>>>> On 2/1/24 15:55, Cédric Le Goater wrote: >>>>>>>>>> On 12/12/23 17:29, Philippe Mathieu-Daudé wrote: >>>>>>>>>>> Hi, >>>>>>>>>>> >>>>>>>>>>> When a MPCore cluster is used, the Cortex-A cores belong the the >>>>>>>>>>> cluster container, not to the board/soc layer. This series move >>>>>>>>>>> the creation of vCPUs to the MPCore private container. >>>>>>>>>>> >>>>>>>>>>> Doing so we consolidate the QOM model, moving common code in a >>>>>>>>>>> central place (abstract MPCore parent). >>>>>>>>>> >>>>>>>>>> Changing the QOM hierarchy has an impact on the state of the machine >>>>>>>>>> and some fixups are then required to maintain migration compatibility. >>>>>>>>>> This can become a real headache for KVM machines like virt for which >>>>>>>>>> migration compatibility is a feature, less for emulated ones. >>>>>>>>> >>>>>>>>> All changes are either moving properties (which are not migrated) >>>>>>>>> or moving non-migrated QOM members (i.e. pointers of ARMCPU, which >>>>>>>>> is still migrated elsewhere). So I don't see any obvious migration >>>>>>>>> problem, but I might be missing something, so I Cc'ed Juan :> >>>>>> >>>>>> FWIW, I didn't spot anything problematic either. >>>>>> >>>>>> I've ran this through my migration compatibility series [1] and it >>>>>> doesn't regress aarch64 migration from/to 8.2. The tests use '-M >>>>>> virt -cpu max', so the cortex-a7 and cortex-a15 are not covered. I don't >>>>>> think we even support migration of anything non-KVM on arm. >>>>> >>>>> it happens we do. >>>>> >>>> >>>> Oh, sorry, I didn't mean TCG here. Probably meant to say something like >>>> non-KVM-capable cpus, as in 32-bit. Nevermind. >>> >>> Theoretically, we should be able to migrate to a TCG guest. Well, this >>> worked in the past for PPC. When I was doing more KVM related changes, >>> this was very useful for dev. Also, some machines are partially emulated. >>> Anyhow I agree this is not a strong requirement and we often break it. >>> Let's focus on KVM only. >> >> No no, we want the same for TCG. >> >>>>>> 1- https://gitlab.com/farosas/qemu/-/jobs/5853599533 >>>>> >>>>> yes it depends on the QOM hierarchy and virt seems immune to the changes. >>>>> Good. >>>>> >>>>> However, changing the QOM topology clearly breaks migration compat, >>>> >>>> Well, "clearly" is relative =) You've mentioned pseries and aspeed >>>> already, do you have a pointer to one of those cases were we broke >>>> migration >>> >>> Regarding pseries, migration compat broke because of 5bc8d26de20c >>> ("spapr: allocate the ICPState object from under sPAPRCPUCore") which >>> is similar to the changes proposed by this series, it impacts the QOM >>> hierarchy. Here is the workaround/fix from Greg : 46f7afa37096 >>> ("spapr: fix migration of ICPState objects from/to older QEMU") which >>> is quite an headache and this turned out to raise another problem some >>> months ago ... :/ That's why I sent [1] to prepare removal of old >>> machines and workarounds becoming a burden. >>> >>> Regarding aspeed, this series breaks compat. >> >> Can you write down the steps to reproduce please? I'll debug it. > > Also, have you figured (bisecting) which patch start to break? Sorry I didn't. I wished I had more time for that. Thanks, C.
On 1/10/24 07:03, Markus Armbruster wrote: > Peter Xu <peterx@redhat.com> writes: > >> On Tue, Jan 09, 2024 at 10:22:31PM +0100, Philippe Mathieu-Daudé wrote: >>> Hi Fabiano, >>> >>> On 9/1/24 21:21, Fabiano Rosas wrote: >>>> Cédric Le Goater <clg@kaod.org> writes: >>>> >>>>> On 1/9/24 18:40, Fabiano Rosas wrote: >>>>>> Cédric Le Goater <clg@kaod.org> writes: >>>>>> >>>>>>> On 1/3/24 20:53, Fabiano Rosas wrote: >>>>>>>> Philippe Mathieu-Daudé <philmd@linaro.org> writes: >>>>>>>> >>>>>>>>> +Peter/Fabiano >>>>>>>>> >>>>>>>>> On 2/1/24 17:41, Cédric Le Goater wrote: >>>>>>>>>> On 1/2/24 17:15, Philippe Mathieu-Daudé wrote: >>>>>>>>>>> Hi Cédric, >>>>>>>>>>> >>>>>>>>>>> On 2/1/24 15:55, Cédric Le Goater wrote: >>>>>>>>>>>> On 12/12/23 17:29, Philippe Mathieu-Daudé wrote: >>>>>>>>>>>>> Hi, >>>>>>>>>>>>> >>>>>>>>>>>>> When a MPCore cluster is used, the Cortex-A cores belong the the >>>>>>>>>>>>> cluster container, not to the board/soc layer. This series move >>>>>>>>>>>>> the creation of vCPUs to the MPCore private container. >>>>>>>>>>>>> >>>>>>>>>>>>> Doing so we consolidate the QOM model, moving common code in a >>>>>>>>>>>>> central place (abstract MPCore parent). >>>>>>>>>>>> >>>>>>>>>>>> Changing the QOM hierarchy has an impact on the state of the machine >>>>>>>>>>>> and some fixups are then required to maintain migration compatibility. >>>>>>>>>>>> This can become a real headache for KVM machines like virt for which >>>>>>>>>>>> migration compatibility is a feature, less for emulated ones. >>>>>>>>>>> >>>>>>>>>>> All changes are either moving properties (which are not migrated) >>>>>>>>>>> or moving non-migrated QOM members (i.e. pointers of ARMCPU, which >>>>>>>>>>> is still migrated elsewhere). So I don't see any obvious migration >>>>>>>>>>> problem, but I might be missing something, so I Cc'ed Juan :> >>>>>>>> >>>>>>>> FWIW, I didn't spot anything problematic either. >>>>>>>> >>>>>>>> I've ran this through my migration compatibility series [1] and it >>>>>>>> doesn't regress aarch64 migration from/to 8.2. The tests use '-M >>>>>>>> virt -cpu max', so the cortex-a7 and cortex-a15 are not covered. I don't >>>>>>>> think we even support migration of anything non-KVM on arm. >>>>>>> >>>>>>> it happens we do. >>>>>>> >>>>>> >>>>>> Oh, sorry, I didn't mean TCG here. Probably meant to say something like >>>>>> non-KVM-capable cpus, as in 32-bit. Nevermind. >>>>> >>>>> Theoretically, we should be able to migrate to a TCG guest. Well, this >>>>> worked in the past for PPC. When I was doing more KVM related changes, >>>>> this was very useful for dev. Also, some machines are partially emulated. >>>>> Anyhow I agree this is not a strong requirement and we often break it. >>>>> Let's focus on KVM only. >>>>> >>>>>>>> 1- https://gitlab.com/farosas/qemu/-/jobs/5853599533 >>>>>>> >>>>>>> yes it depends on the QOM hierarchy and virt seems immune to the changes. >>>>>>> Good. >>>>>>> >>>>>>> However, changing the QOM topology clearly breaks migration compat, >>>>>> >>>>>> Well, "clearly" is relative =) You've mentioned pseries and aspeed >>>>>> already, do you have a pointer to one of those cases were we broke >>>>>> migration >>>>> >>>>> Regarding pseries, migration compat broke because of 5bc8d26de20c >>>>> ("spapr: allocate the ICPState object from under sPAPRCPUCore") which >>>>> is similar to the changes proposed by this series, it impacts the QOM >>>>> hierarchy. Here is the workaround/fix from Greg : 46f7afa37096 >>>>> ("spapr: fix migration of ICPState objects from/to older QEMU") which >>>>> is quite an headache and this turned out to raise another problem some >>>>> months ago ... :/ That's why I sent [1] to prepare removal of old >>>>> machines and workarounds becoming a burden. >>>> >>>> This feels like something that could be handled by the vmstate code >>>> somehow. The state is there, just under a different path. >>> >>> What, the QOM path is used in migration? ... >> >> Hopefully not.. >> >>> >>> See recent discussions on "QOM path stability": >>> https://lore.kernel.org/qemu-devel/ZZfYvlmcxBCiaeWE@redhat.com/ >>> https://lore.kernel.org/qemu-devel/87jzojbxt7.fsf@pond.sub.org/ >>> https://lore.kernel.org/qemu-devel/87v883by34.fsf@pond.sub.org/ >> >> If I read it right, the commit 46f7afa37096 example is pretty special that >> the QOM path more or less decided more than the hierachy itself but changes >> the existances of objects. > > Let's see whether I got this... > > We removed some useless objects, moved the useful ones to another home. > The move changed their QOM path. They interrupt controller presenter objects were quite useful :) From what I recall, we moved them from an array under the machine to the CPU object, so the interrupt controller presenter states previously under the machine were not there anymore and this broke migration compatibility. Sorry for the noise if this is not a problem anymore. It was at the time and we found a way to work around it; I should probably say we hacked our way around it. Nevertheless, this was kind of a trauma too because since I never dared touch the QOM hierarchy of a migratable machine again. Migration is complicated. > The problem was the removal of useless objects, because this also > removed their vmstate. > > The fix was adding the vmstate back as a dummy. > > The QOM patch changes are *not* part of the problem. > > Correct? > >>>> No one wants >>>> to be policing QOM hierarchy changes in every single series that shows >>>> up on the list. >>>> >>>> Anyway, thanks for the pointers. I'll study that code a bit more, maybe >>>> I can come up with some way to handle these cases. >>>> >>>> Hopefully between the analyze-migration test and the compat tests we'll >>>> catch the next bug of this kind before it gets merged. >> >> Things like that might be able to be detected via vmstate-static-checker.py. >> But I'm not 100% sure, also its coverage is limited. >> >> For example, I don't think it can detect changes to objects that will only >> be created dynamically, e.g., I think sometimes we create objects after >> some guest behaviors (consider guest enables the device, then QEMU >> emulation creates some objects on demand of device setup?), > > Feels nuts to me. > > In real hardware, software enabling a device that is disabled by default > doesn't create the device. The device is always there, it just happens > to be inactive unless enabled. We should model the device just like > that. yes. That's how we modeled the two interrupt modes in pseries. The machine has two interrupt controller devices model always present and the cpus, two interrupt presenters. SW negotiates with the platform (QEMU) which mode to activate. This is the only way to support migration with an OS that can choose such complex features. For the context, POWER9 introduced a new flavor of HW logic for interrupts, which scaled better on large system (16s) and guests with newer OS could dynamically switch the SW interface to choose the new implementation. Thanks, C. > >> and since the >> static checker shouldn't ever start the VM and run any code, they won't >> trigger. >
On 1/10/24 07:26, Peter Xu wrote: > On Wed, Jan 10, 2024 at 07:03:06AM +0100, Markus Armbruster wrote: >> Peter Xu <peterx@redhat.com> writes: >> >>> On Tue, Jan 09, 2024 at 10:22:31PM +0100, Philippe Mathieu-Daudé wrote: >>>> Hi Fabiano, >>>> >>>> On 9/1/24 21:21, Fabiano Rosas wrote: >>>>> Cédric Le Goater <clg@kaod.org> writes: >>>>> >>>>>> On 1/9/24 18:40, Fabiano Rosas wrote: >>>>>>> Cédric Le Goater <clg@kaod.org> writes: >>>>>>> >>>>>>>> On 1/3/24 20:53, Fabiano Rosas wrote: >>>>>>>>> Philippe Mathieu-Daudé <philmd@linaro.org> writes: >>>>>>>>> >>>>>>>>>> +Peter/Fabiano >>>>>>>>>> >>>>>>>>>> On 2/1/24 17:41, Cédric Le Goater wrote: >>>>>>>>>>> On 1/2/24 17:15, Philippe Mathieu-Daudé wrote: >>>>>>>>>>>> Hi Cédric, >>>>>>>>>>>> >>>>>>>>>>>> On 2/1/24 15:55, Cédric Le Goater wrote: >>>>>>>>>>>>> On 12/12/23 17:29, Philippe Mathieu-Daudé wrote: >>>>>>>>>>>>>> Hi, >>>>>>>>>>>>>> >>>>>>>>>>>>>> When a MPCore cluster is used, the Cortex-A cores belong the the >>>>>>>>>>>>>> cluster container, not to the board/soc layer. This series move >>>>>>>>>>>>>> the creation of vCPUs to the MPCore private container. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Doing so we consolidate the QOM model, moving common code in a >>>>>>>>>>>>>> central place (abstract MPCore parent). >>>>>>>>>>>>> >>>>>>>>>>>>> Changing the QOM hierarchy has an impact on the state of the machine >>>>>>>>>>>>> and some fixups are then required to maintain migration compatibility. >>>>>>>>>>>>> This can become a real headache for KVM machines like virt for which >>>>>>>>>>>>> migration compatibility is a feature, less for emulated ones. >>>>>>>>>>>> >>>>>>>>>>>> All changes are either moving properties (which are not migrated) >>>>>>>>>>>> or moving non-migrated QOM members (i.e. pointers of ARMCPU, which >>>>>>>>>>>> is still migrated elsewhere). So I don't see any obvious migration >>>>>>>>>>>> problem, but I might be missing something, so I Cc'ed Juan :> >>>>>>>>> >>>>>>>>> FWIW, I didn't spot anything problematic either. >>>>>>>>> >>>>>>>>> I've ran this through my migration compatibility series [1] and it >>>>>>>>> doesn't regress aarch64 migration from/to 8.2. The tests use '-M >>>>>>>>> virt -cpu max', so the cortex-a7 and cortex-a15 are not covered. I don't >>>>>>>>> think we even support migration of anything non-KVM on arm. >>>>>>>> >>>>>>>> it happens we do. >>>>>>>> >>>>>>> >>>>>>> Oh, sorry, I didn't mean TCG here. Probably meant to say something like >>>>>>> non-KVM-capable cpus, as in 32-bit. Nevermind. >>>>>> >>>>>> Theoretically, we should be able to migrate to a TCG guest. Well, this >>>>>> worked in the past for PPC. When I was doing more KVM related changes, >>>>>> this was very useful for dev. Also, some machines are partially emulated. >>>>>> Anyhow I agree this is not a strong requirement and we often break it. >>>>>> Let's focus on KVM only. >>>>>> >>>>>>>>> 1- https://gitlab.com/farosas/qemu/-/jobs/5853599533 >>>>>>>> >>>>>>>> yes it depends on the QOM hierarchy and virt seems immune to the changes. >>>>>>>> Good. >>>>>>>> >>>>>>>> However, changing the QOM topology clearly breaks migration compat, >>>>>>> >>>>>>> Well, "clearly" is relative =) You've mentioned pseries and aspeed >>>>>>> already, do you have a pointer to one of those cases were we broke >>>>>>> migration >>>>>> >>>>>> Regarding pseries, migration compat broke because of 5bc8d26de20c >>>>>> ("spapr: allocate the ICPState object from under sPAPRCPUCore") which >>>>>> is similar to the changes proposed by this series, it impacts the QOM >>>>>> hierarchy. Here is the workaround/fix from Greg : 46f7afa37096 >>>>>> ("spapr: fix migration of ICPState objects from/to older QEMU") which >>>>>> is quite an headache and this turned out to raise another problem some >>>>>> months ago ... :/ That's why I sent [1] to prepare removal of old >>>>>> machines and workarounds becoming a burden. >>>>> >>>>> This feels like something that could be handled by the vmstate code >>>>> somehow. The state is there, just under a different path. >>>> >>>> What, the QOM path is used in migration? ... >>> >>> Hopefully not.. >>> >>>> >>>> See recent discussions on "QOM path stability": >>>> https://lore.kernel.org/qemu-devel/ZZfYvlmcxBCiaeWE@redhat.com/ >>>> https://lore.kernel.org/qemu-devel/87jzojbxt7.fsf@pond.sub.org/ >>>> https://lore.kernel.org/qemu-devel/87v883by34.fsf@pond.sub.org/ >>> >>> If I read it right, the commit 46f7afa37096 example is pretty special that >>> the QOM path more or less decided more than the hierachy itself but changes >>> the existances of objects. >> >> Let's see whether I got this... >> >> We removed some useless objects, moved the useful ones to another home. >> The move changed their QOM path. >> >> The problem was the removal of useless objects, because this also >> removed their vmstate. >> >> The fix was adding the vmstate back as a dummy. >> >> The QOM patch changes are *not* part of the problem. >> >> Correct? > > [I'd leave this to Cedric] > >> >>>>> No one wants >>>>> to be policing QOM hierarchy changes in every single series that shows >>>>> up on the list. >>>>> >>>>> Anyway, thanks for the pointers. I'll study that code a bit more, maybe >>>>> I can come up with some way to handle these cases. >>>>> >>>>> Hopefully between the analyze-migration test and the compat tests we'll >>>>> catch the next bug of this kind before it gets merged. >>> >>> Things like that might be able to be detected via vmstate-static-checker.py. >>> But I'm not 100% sure, also its coverage is limited. >>> >>> For example, I don't think it can detect changes to objects that will only >>> be created dynamically, e.g., I think sometimes we create objects after >>> some guest behaviors (consider guest enables the device, then QEMU >>> emulation creates some objects on demand of device setup?), >> >> Feels nuts to me. >> >> In real hardware, software enabling a device that is disabled by default >> doesn't create the device. The device is always there, it just happens >> to be inactive unless enabled. We should model the device just like >> that. > > It doesn't need to be the device itself to be dynamically created, but some > other sub-objects that do not require to exist until the device is enabled, > or some specific function of that device is enabled. It is logically doable. > > Is the example Cedric provided looks like some case like this? I am not > sure, that's also why I'm not sure the static checker would work here. But > logically it seems possible, e.g. with migration VMSD needed() facilities. > Consider a device has a sub-function that requires a sub-object. It may > not need to migrate that object if that sub-feature is not even enabled. > If that object is very large, it might be wise to do so if possible to not > send chunks of junk during the VM downtime. > > But then after a 2nd thought I do agree it's probably not sensible, because > even if the src may know whether the sub-object will be needed, there's > probably no good way for the dest QEMU to know. It can only know in > something like a post_load() hook, but logically that can happen only after > a full load of that device state, so might already be too late. If features can be dynamically enabled by the OS, after negotiation with the platform, the source should prepare for all possible scenarios and migrate all models supported for a given configuration. The OS could choose to enable a feature on the target which was not enabled on the source before migration. Thanks, C.
On 1/10/24 14:19, Fabiano Rosas wrote: > Markus Armbruster <armbru@redhat.com> writes: > >> Peter Xu <peterx@redhat.com> writes: >> >>> On Tue, Jan 09, 2024 at 10:22:31PM +0100, Philippe Mathieu-Daudé wrote: >>>> Hi Fabiano, >>>> >>>> On 9/1/24 21:21, Fabiano Rosas wrote: >>>>> Cédric Le Goater <clg@kaod.org> writes: >>>>> >>>>>> On 1/9/24 18:40, Fabiano Rosas wrote: >>>>>>> Cédric Le Goater <clg@kaod.org> writes: >>>>>>> >>>>>>>> On 1/3/24 20:53, Fabiano Rosas wrote: >>>>>>>>> Philippe Mathieu-Daudé <philmd@linaro.org> writes: >>>>>>>>> >>>>>>>>>> +Peter/Fabiano >>>>>>>>>> >>>>>>>>>> On 2/1/24 17:41, Cédric Le Goater wrote: >>>>>>>>>>> On 1/2/24 17:15, Philippe Mathieu-Daudé wrote: >>>>>>>>>>>> Hi Cédric, >>>>>>>>>>>> >>>>>>>>>>>> On 2/1/24 15:55, Cédric Le Goater wrote: >>>>>>>>>>>>> On 12/12/23 17:29, Philippe Mathieu-Daudé wrote: >>>>>>>>>>>>>> Hi, >>>>>>>>>>>>>> >>>>>>>>>>>>>> When a MPCore cluster is used, the Cortex-A cores belong the the >>>>>>>>>>>>>> cluster container, not to the board/soc layer. This series move >>>>>>>>>>>>>> the creation of vCPUs to the MPCore private container. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Doing so we consolidate the QOM model, moving common code in a >>>>>>>>>>>>>> central place (abstract MPCore parent). >>>>>>>>>>>>> >>>>>>>>>>>>> Changing the QOM hierarchy has an impact on the state of the machine >>>>>>>>>>>>> and some fixups are then required to maintain migration compatibility. >>>>>>>>>>>>> This can become a real headache for KVM machines like virt for which >>>>>>>>>>>>> migration compatibility is a feature, less for emulated ones. >>>>>>>>>>>> >>>>>>>>>>>> All changes are either moving properties (which are not migrated) >>>>>>>>>>>> or moving non-migrated QOM members (i.e. pointers of ARMCPU, which >>>>>>>>>>>> is still migrated elsewhere). So I don't see any obvious migration >>>>>>>>>>>> problem, but I might be missing something, so I Cc'ed Juan :> >>>>>>>>> >>>>>>>>> FWIW, I didn't spot anything problematic either. >>>>>>>>> >>>>>>>>> I've ran this through my migration compatibility series [1] and it >>>>>>>>> doesn't regress aarch64 migration from/to 8.2. The tests use '-M >>>>>>>>> virt -cpu max', so the cortex-a7 and cortex-a15 are not covered. I don't >>>>>>>>> think we even support migration of anything non-KVM on arm. >>>>>>>> >>>>>>>> it happens we do. >>>>>>>> >>>>>>> >>>>>>> Oh, sorry, I didn't mean TCG here. Probably meant to say something like >>>>>>> non-KVM-capable cpus, as in 32-bit. Nevermind. >>>>>> >>>>>> Theoretically, we should be able to migrate to a TCG guest. Well, this >>>>>> worked in the past for PPC. When I was doing more KVM related changes, >>>>>> this was very useful for dev. Also, some machines are partially emulated. >>>>>> Anyhow I agree this is not a strong requirement and we often break it. >>>>>> Let's focus on KVM only. >>>>>> >>>>>>>>> 1- https://gitlab.com/farosas/qemu/-/jobs/5853599533 >>>>>>>> >>>>>>>> yes it depends on the QOM hierarchy and virt seems immune to the changes. >>>>>>>> Good. >>>>>>>> >>>>>>>> However, changing the QOM topology clearly breaks migration compat, >>>>>>> >>>>>>> Well, "clearly" is relative =) You've mentioned pseries and aspeed >>>>>>> already, do you have a pointer to one of those cases were we broke >>>>>>> migration >>>>>> >>>>>> Regarding pseries, migration compat broke because of 5bc8d26de20c >>>>>> ("spapr: allocate the ICPState object from under sPAPRCPUCore") which >>>>>> is similar to the changes proposed by this series, it impacts the QOM >>>>>> hierarchy. Here is the workaround/fix from Greg : 46f7afa37096 >>>>>> ("spapr: fix migration of ICPState objects from/to older QEMU") which >>>>>> is quite an headache and this turned out to raise another problem some >>>>>> months ago ... :/ That's why I sent [1] to prepare removal of old >>>>>> machines and workarounds becoming a burden. >>>>> >>>>> This feels like something that could be handled by the vmstate code >>>>> somehow. The state is there, just under a different path. >>>> >>>> What, the QOM path is used in migration? ... >>> >>> Hopefully not.. > > Unfortunately the original fix doesn't mention _what_ actually broke > with migration. I assumed the QOM path was needed because otherwise I > don't think the fix makes sense. The thread discussing that patch also > directly mentions the QOM path: > > https://www.mail-archive.com/qemu-devel@nongnu.org/msg450912.html > > But I probably misunderstood something while reading that thread. > >>> >>>> >>>> See recent discussions on "QOM path stability": >>>> https://lore.kernel.org/qemu-devel/ZZfYvlmcxBCiaeWE@redhat.com/ >>>> https://lore.kernel.org/qemu-devel/87jzojbxt7.fsf@pond.sub.org/ >>>> https://lore.kernel.org/qemu-devel/87v883by34.fsf@pond.sub.org/ >>> >>> If I read it right, the commit 46f7afa37096 example is pretty special that >>> the QOM path more or less decided more than the hierachy itself but changes >>> the existances of objects. >> >> Let's see whether I got this... >> >> We removed some useless objects, moved the useful ones to another home. >> The move changed their QOM path. >> >> The problem was the removal of useless objects, because this also >> removed their vmstate. > > If you checkout at the removal commit (5bc8d26de20c), the vmstate has > been kept untouched. > >> >> The fix was adding the vmstate back as a dummy. > > Since the vmstate was kept I don't see why would we need a dummy. The > incoming migration stream would still have the state, only at a > different point in the stream. It's surprising to me that that would > cause an issue, but I'm not well versed in that code. > >> >> The QOM patch changes are *not* part of the problem. > > The only explanation I can come up with is that after the patch > migration has broken after a hotplug or similar operation. In such > situation, the preallocated state would always be present before the > patch, but sometimes not present after the patch in case, say, a > hot-unplug has taken away a cpu + ICPState. May be. It has been a while. For a better understanding, I tried to reproduce. QEMU 2.9 still compiles on a f38 (memfd_create needs a fix). However, a QEMU-2.9/pseries-2.9 machine won't start any recent OS because something broke the OS/platform negotiation process (CAS) ... I removed the pre_2_10_has_unused_icps hack and worked around another migration compat issue in IOMMU ... Here are the last trace-events for migration : QEMU-mainline/pseries-2.9 -> QEMU-2.9/pseries-2.9 1016464@1705053752.106417:vmstate_load spapr, spapr 1016464@1705053752.106419:vmstate_load_state spapr v3 1016464@1705053752.106421:vmstate_load_state_field spapr:unused 1016464@1705053752.106424:vmstate_load_state_field spapr:rtc_offset 1016464@1705053752.106425:vmstate_load_state_field spapr:tb 1016464@1705053752.106427:vmstate_n_elems tb: 1 1016464@1705053752.106429:vmstate_load_state timebase v1 1016464@1705053752.106432:vmstate_load_state_field timebase:guest_timebase 1016464@1705053752.106433:vmstate_n_elems guest_timebase: 1 1016464@1705053752.106435:vmstate_load_state_field timebase:time_of_the_day_ns 1016464@1705053752.106436:vmstate_n_elems time_of_the_day_ns: 1 1016464@1705053752.106438:vmstate_subsection_load timebase 1016464@1705053752.106440:vmstate_subsection_load_bad timebase: spapr_option_vector_ov5_cas/(prefix) 1016464@1705053752.106442:vmstate_load_state_end timebase end/0 1016464@1705053752.106443:vmstate_subsection_load spapr 1016464@1705053752.106445:vmstate_load_state spapr_option_vector_ov5_cas v1 1016464@1705053752.106447:vmstate_load_state_field spapr_option_vector_ov5_cas:ov5_cas 1016464@1705053752.106448:vmstate_n_elems ov5_cas: 1 1016464@1705053752.106450:vmstate_load_state spapr_option_vector v1 1016464@1705053752.106452:vmstate_load_state_field spapr_option_vector:bitmap 1016464@1705053752.106454:vmstate_n_elems bitmap: 1 1016464@1705053752.106456:vmstate_subsection_load spapr_option_vector 1016464@1705053752.106457:vmstate_subsection_load_bad spapr_option_vector: (short)/ 1016464@1705053752.106459:vmstate_load_state_end spapr_option_vector end/0 1016464@1705053752.106461:vmstate_subsection_load spapr_option_vector_ov5_cas 1016464@1705053752.106462:vmstate_subsection_load_bad spapr_option_vector_ov5_cas: (short)/ 1016464@1705053752.106465:vmstate_load_state_end spapr_option_vector_ov5_cas end/0 1016464@1705053752.106466:vmstate_subsection_load_bad spapr: spapr/cap/htm/(lookup) qemu-system-ppc64: error while loading state for instance 0x0 of device 'spapr' Hope it helps, Thanks, C.
Cédric Le Goater <clg@kaod.org> writes: > On 1/10/24 14:19, Fabiano Rosas wrote: >> Markus Armbruster <armbru@redhat.com> writes: >> >>> Peter Xu <peterx@redhat.com> writes: >>> >>>> On Tue, Jan 09, 2024 at 10:22:31PM +0100, Philippe Mathieu-Daudé wrote: >>>>> Hi Fabiano, >>>>> >>>>> On 9/1/24 21:21, Fabiano Rosas wrote: >>>>>> Cédric Le Goater <clg@kaod.org> writes: >>>>>> >>>>>>> On 1/9/24 18:40, Fabiano Rosas wrote: >>>>>>>> Cédric Le Goater <clg@kaod.org> writes: >>>>>>>> >>>>>>>>> On 1/3/24 20:53, Fabiano Rosas wrote: >>>>>>>>>> Philippe Mathieu-Daudé <philmd@linaro.org> writes: >>>>>>>>>> >>>>>>>>>>> +Peter/Fabiano >>>>>>>>>>> >>>>>>>>>>> On 2/1/24 17:41, Cédric Le Goater wrote: >>>>>>>>>>>> On 1/2/24 17:15, Philippe Mathieu-Daudé wrote: >>>>>>>>>>>>> Hi Cédric, >>>>>>>>>>>>> >>>>>>>>>>>>> On 2/1/24 15:55, Cédric Le Goater wrote: >>>>>>>>>>>>>> On 12/12/23 17:29, Philippe Mathieu-Daudé wrote: >>>>>>>>>>>>>>> Hi, >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> When a MPCore cluster is used, the Cortex-A cores belong the the >>>>>>>>>>>>>>> cluster container, not to the board/soc layer. This series move >>>>>>>>>>>>>>> the creation of vCPUs to the MPCore private container. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Doing so we consolidate the QOM model, moving common code in a >>>>>>>>>>>>>>> central place (abstract MPCore parent). >>>>>>>>>>>>>> >>>>>>>>>>>>>> Changing the QOM hierarchy has an impact on the state of the machine >>>>>>>>>>>>>> and some fixups are then required to maintain migration compatibility. >>>>>>>>>>>>>> This can become a real headache for KVM machines like virt for which >>>>>>>>>>>>>> migration compatibility is a feature, less for emulated ones. >>>>>>>>>>>>> >>>>>>>>>>>>> All changes are either moving properties (which are not migrated) >>>>>>>>>>>>> or moving non-migrated QOM members (i.e. pointers of ARMCPU, which >>>>>>>>>>>>> is still migrated elsewhere). So I don't see any obvious migration >>>>>>>>>>>>> problem, but I might be missing something, so I Cc'ed Juan :> >>>>>>>>>> >>>>>>>>>> FWIW, I didn't spot anything problematic either. >>>>>>>>>> >>>>>>>>>> I've ran this through my migration compatibility series [1] and it >>>>>>>>>> doesn't regress aarch64 migration from/to 8.2. The tests use '-M >>>>>>>>>> virt -cpu max', so the cortex-a7 and cortex-a15 are not covered. I don't >>>>>>>>>> think we even support migration of anything non-KVM on arm. >>>>>>>>> >>>>>>>>> it happens we do. >>>>>>>>> >>>>>>>> >>>>>>>> Oh, sorry, I didn't mean TCG here. Probably meant to say something like >>>>>>>> non-KVM-capable cpus, as in 32-bit. Nevermind. >>>>>>> >>>>>>> Theoretically, we should be able to migrate to a TCG guest. Well, this >>>>>>> worked in the past for PPC. When I was doing more KVM related changes, >>>>>>> this was very useful for dev. Also, some machines are partially emulated. >>>>>>> Anyhow I agree this is not a strong requirement and we often break it. >>>>>>> Let's focus on KVM only. >>>>>>> >>>>>>>>>> 1- https://gitlab.com/farosas/qemu/-/jobs/5853599533 >>>>>>>>> >>>>>>>>> yes it depends on the QOM hierarchy and virt seems immune to the changes. >>>>>>>>> Good. >>>>>>>>> >>>>>>>>> However, changing the QOM topology clearly breaks migration compat, >>>>>>>> >>>>>>>> Well, "clearly" is relative =) You've mentioned pseries and aspeed >>>>>>>> already, do you have a pointer to one of those cases were we broke >>>>>>>> migration >>>>>>> >>>>>>> Regarding pseries, migration compat broke because of 5bc8d26de20c >>>>>>> ("spapr: allocate the ICPState object from under sPAPRCPUCore") which >>>>>>> is similar to the changes proposed by this series, it impacts the QOM >>>>>>> hierarchy. Here is the workaround/fix from Greg : 46f7afa37096 >>>>>>> ("spapr: fix migration of ICPState objects from/to older QEMU") which >>>>>>> is quite an headache and this turned out to raise another problem some >>>>>>> months ago ... :/ That's why I sent [1] to prepare removal of old >>>>>>> machines and workarounds becoming a burden. >>>>>> >>>>>> This feels like something that could be handled by the vmstate code >>>>>> somehow. The state is there, just under a different path. >>>>> >>>>> What, the QOM path is used in migration? ... >>>> >>>> Hopefully not.. >> >> Unfortunately the original fix doesn't mention _what_ actually broke >> with migration. I assumed the QOM path was needed because otherwise I >> don't think the fix makes sense. The thread discussing that patch also >> directly mentions the QOM path: >> >> https://www.mail-archive.com/qemu-devel@nongnu.org/msg450912.html >> >> But I probably misunderstood something while reading that thread. >> >>>> >>>>> >>>>> See recent discussions on "QOM path stability": >>>>> https://lore.kernel.org/qemu-devel/ZZfYvlmcxBCiaeWE@redhat.com/ >>>>> https://lore.kernel.org/qemu-devel/87jzojbxt7.fsf@pond.sub.org/ >>>>> https://lore.kernel.org/qemu-devel/87v883by34.fsf@pond.sub.org/ >>>> >>>> If I read it right, the commit 46f7afa37096 example is pretty special that >>>> the QOM path more or less decided more than the hierachy itself but changes >>>> the existances of objects. >>> >>> Let's see whether I got this... >>> >>> We removed some useless objects, moved the useful ones to another home. >>> The move changed their QOM path. >>> >>> The problem was the removal of useless objects, because this also >>> removed their vmstate. >> >> If you checkout at the removal commit (5bc8d26de20c), the vmstate has >> been kept untouched. >> >>> >>> The fix was adding the vmstate back as a dummy. >> >> Since the vmstate was kept I don't see why would we need a dummy. The >> incoming migration stream would still have the state, only at a >> different point in the stream. It's surprising to me that that would >> cause an issue, but I'm not well versed in that code. >> >>> >>> The QOM patch changes are *not* part of the problem. >> >> The only explanation I can come up with is that after the patch >> migration has broken after a hotplug or similar operation. In such >> situation, the preallocated state would always be present before the >> patch, but sometimes not present after the patch in case, say, a >> hot-unplug has taken away a cpu + ICPState. > > May be. It has been a while. > > For a better understanding, I tried to reproduce. QEMU 2.9 still > compiles on a f38 (memfd_create needs a fix). However, a > QEMU-2.9/pseries-2.9 machine won't start any recent OS because > something broke the OS/platform negotiation process (CAS) ... Thanks for letting us know it still builds. That motivated me to try as well. Although it took me a while because of python's brokenness. The issue was indeed the difference in preallocation vs. no preallocation. I don't remember how to do hotplug in ppc anymore, but maxcpus was sufficient to reproduce: src 2.9.0: $ ./ppc64-softmmu/qemu-system-ppc64 -serial mon:stdio -nographic -vga none -display none -accel tcg -machine pseries -smp 4,maxcpus=8 -m 256M -nodefaults dst 2.9.0 prior to 46f7afa: $ ./ppc64-softmmu/qemu-system-ppc64 -serial mon:stdio -display none -accel tcg -machine pseries-2.9 -smp 4,maxcpus=8 -m 256M -nodefaults -incoming defer (qemu) migrate_incoming tcp:localhost:1234 (qemu) qemu-system-ppc64: Unknown savevm section or instance 'icp/server' 4 qemu-system-ppc64: load of migration failed: Invalid argument As you can see, both use 4 cpus, but the src will try to migrate all 8 icp/server instances. After the patch, the same migration works. So this was not related to the QOM path nor the QOM hierarchy, it was just about having state created at machine init vs. cpu realize time. It might be wise to keep an eye on the QOM hierarchy anyway as it could make these kinds of changes more evident during review. > I removed the pre_2_10_has_unused_icps hack and worked around > another migration compat issue in IOMMU ... Here are the last > trace-events for migration : > > QEMU-mainline/pseries-2.9 -> QEMU-2.9/pseries-2.9 > > 1016464@1705053752.106417:vmstate_load spapr, spapr > 1016464@1705053752.106419:vmstate_load_state spapr v3 > 1016464@1705053752.106421:vmstate_load_state_field spapr:unused > 1016464@1705053752.106424:vmstate_load_state_field spapr:rtc_offset > 1016464@1705053752.106425:vmstate_load_state_field spapr:tb > 1016464@1705053752.106427:vmstate_n_elems tb: 1 > 1016464@1705053752.106429:vmstate_load_state timebase v1 > 1016464@1705053752.106432:vmstate_load_state_field timebase:guest_timebase > 1016464@1705053752.106433:vmstate_n_elems guest_timebase: 1 > 1016464@1705053752.106435:vmstate_load_state_field timebase:time_of_the_day_ns > 1016464@1705053752.106436:vmstate_n_elems time_of_the_day_ns: 1 > 1016464@1705053752.106438:vmstate_subsection_load timebase > 1016464@1705053752.106440:vmstate_subsection_load_bad timebase: spapr_option_vector_ov5_cas/(prefix) > 1016464@1705053752.106442:vmstate_load_state_end timebase end/0 > 1016464@1705053752.106443:vmstate_subsection_load spapr > 1016464@1705053752.106445:vmstate_load_state spapr_option_vector_ov5_cas v1 > 1016464@1705053752.106447:vmstate_load_state_field spapr_option_vector_ov5_cas:ov5_cas > 1016464@1705053752.106448:vmstate_n_elems ov5_cas: 1 > 1016464@1705053752.106450:vmstate_load_state spapr_option_vector v1 > 1016464@1705053752.106452:vmstate_load_state_field spapr_option_vector:bitmap > 1016464@1705053752.106454:vmstate_n_elems bitmap: 1 > 1016464@1705053752.106456:vmstate_subsection_load spapr_option_vector > 1016464@1705053752.106457:vmstate_subsection_load_bad spapr_option_vector: (short)/ > 1016464@1705053752.106459:vmstate_load_state_end spapr_option_vector end/0 > 1016464@1705053752.106461:vmstate_subsection_load spapr_option_vector_ov5_cas > 1016464@1705053752.106462:vmstate_subsection_load_bad spapr_option_vector_ov5_cas: (short)/ > 1016464@1705053752.106465:vmstate_load_state_end spapr_option_vector_ov5_cas end/0 > 1016464@1705053752.106466:vmstate_subsection_load_bad spapr: spapr/cap/htm/(lookup) > qemu-system-ppc64: error while loading state for instance 0x0 of device 'spapr' This trace this seems to be something else entirely. I'd be amazed if migrating mainline vs. something that old worked at all, even on x86.
On 1/12/24 20:54, Fabiano Rosas wrote: > Cédric Le Goater <clg@kaod.org> writes: > >> On 1/10/24 14:19, Fabiano Rosas wrote: >>> Markus Armbruster <armbru@redhat.com> writes: >>> >>>> Peter Xu <peterx@redhat.com> writes: >>>> >>>>> On Tue, Jan 09, 2024 at 10:22:31PM +0100, Philippe Mathieu-Daudé wrote: >>>>>> Hi Fabiano, >>>>>> >>>>>> On 9/1/24 21:21, Fabiano Rosas wrote: >>>>>>> Cédric Le Goater <clg@kaod.org> writes: >>>>>>> >>>>>>>> On 1/9/24 18:40, Fabiano Rosas wrote: >>>>>>>>> Cédric Le Goater <clg@kaod.org> writes: >>>>>>>>> >>>>>>>>>> On 1/3/24 20:53, Fabiano Rosas wrote: >>>>>>>>>>> Philippe Mathieu-Daudé <philmd@linaro.org> writes: >>>>>>>>>>> >>>>>>>>>>>> +Peter/Fabiano >>>>>>>>>>>> >>>>>>>>>>>> On 2/1/24 17:41, Cédric Le Goater wrote: >>>>>>>>>>>>> On 1/2/24 17:15, Philippe Mathieu-Daudé wrote: >>>>>>>>>>>>>> Hi Cédric, >>>>>>>>>>>>>> >>>>>>>>>>>>>> On 2/1/24 15:55, Cédric Le Goater wrote: >>>>>>>>>>>>>>> On 12/12/23 17:29, Philippe Mathieu-Daudé wrote: >>>>>>>>>>>>>>>> Hi, >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> When a MPCore cluster is used, the Cortex-A cores belong the the >>>>>>>>>>>>>>>> cluster container, not to the board/soc layer. This series move >>>>>>>>>>>>>>>> the creation of vCPUs to the MPCore private container. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Doing so we consolidate the QOM model, moving common code in a >>>>>>>>>>>>>>>> central place (abstract MPCore parent). >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Changing the QOM hierarchy has an impact on the state of the machine >>>>>>>>>>>>>>> and some fixups are then required to maintain migration compatibility. >>>>>>>>>>>>>>> This can become a real headache for KVM machines like virt for which >>>>>>>>>>>>>>> migration compatibility is a feature, less for emulated ones. >>>>>>>>>>>>>> >>>>>>>>>>>>>> All changes are either moving properties (which are not migrated) >>>>>>>>>>>>>> or moving non-migrated QOM members (i.e. pointers of ARMCPU, which >>>>>>>>>>>>>> is still migrated elsewhere). So I don't see any obvious migration >>>>>>>>>>>>>> problem, but I might be missing something, so I Cc'ed Juan :> >>>>>>>>>>> >>>>>>>>>>> FWIW, I didn't spot anything problematic either. >>>>>>>>>>> >>>>>>>>>>> I've ran this through my migration compatibility series [1] and it >>>>>>>>>>> doesn't regress aarch64 migration from/to 8.2. The tests use '-M >>>>>>>>>>> virt -cpu max', so the cortex-a7 and cortex-a15 are not covered. I don't >>>>>>>>>>> think we even support migration of anything non-KVM on arm. >>>>>>>>>> >>>>>>>>>> it happens we do. >>>>>>>>>> >>>>>>>>> >>>>>>>>> Oh, sorry, I didn't mean TCG here. Probably meant to say something like >>>>>>>>> non-KVM-capable cpus, as in 32-bit. Nevermind. >>>>>>>> >>>>>>>> Theoretically, we should be able to migrate to a TCG guest. Well, this >>>>>>>> worked in the past for PPC. When I was doing more KVM related changes, >>>>>>>> this was very useful for dev. Also, some machines are partially emulated. >>>>>>>> Anyhow I agree this is not a strong requirement and we often break it. >>>>>>>> Let's focus on KVM only. >>>>>>>> >>>>>>>>>>> 1- https://gitlab.com/farosas/qemu/-/jobs/5853599533 >>>>>>>>>> >>>>>>>>>> yes it depends on the QOM hierarchy and virt seems immune to the changes. >>>>>>>>>> Good. >>>>>>>>>> >>>>>>>>>> However, changing the QOM topology clearly breaks migration compat, >>>>>>>>> >>>>>>>>> Well, "clearly" is relative =) You've mentioned pseries and aspeed >>>>>>>>> already, do you have a pointer to one of those cases were we broke >>>>>>>>> migration >>>>>>>> >>>>>>>> Regarding pseries, migration compat broke because of 5bc8d26de20c >>>>>>>> ("spapr: allocate the ICPState object from under sPAPRCPUCore") which >>>>>>>> is similar to the changes proposed by this series, it impacts the QOM >>>>>>>> hierarchy. Here is the workaround/fix from Greg : 46f7afa37096 >>>>>>>> ("spapr: fix migration of ICPState objects from/to older QEMU") which >>>>>>>> is quite an headache and this turned out to raise another problem some >>>>>>>> months ago ... :/ That's why I sent [1] to prepare removal of old >>>>>>>> machines and workarounds becoming a burden. >>>>>>> >>>>>>> This feels like something that could be handled by the vmstate code >>>>>>> somehow. The state is there, just under a different path. >>>>>> >>>>>> What, the QOM path is used in migration? ... >>>>> >>>>> Hopefully not.. >>> >>> Unfortunately the original fix doesn't mention _what_ actually broke >>> with migration. I assumed the QOM path was needed because otherwise I >>> don't think the fix makes sense. The thread discussing that patch also >>> directly mentions the QOM path: >>> >>> https://www.mail-archive.com/qemu-devel@nongnu.org/msg450912.html >>> >>> But I probably misunderstood something while reading that thread. >>> >>>>> >>>>>> >>>>>> See recent discussions on "QOM path stability": >>>>>> https://lore.kernel.org/qemu-devel/ZZfYvlmcxBCiaeWE@redhat.com/ >>>>>> https://lore.kernel.org/qemu-devel/87jzojbxt7.fsf@pond.sub.org/ >>>>>> https://lore.kernel.org/qemu-devel/87v883by34.fsf@pond.sub.org/ >>>>> >>>>> If I read it right, the commit 46f7afa37096 example is pretty special that >>>>> the QOM path more or less decided more than the hierachy itself but changes >>>>> the existances of objects. >>>> >>>> Let's see whether I got this... >>>> >>>> We removed some useless objects, moved the useful ones to another home. >>>> The move changed their QOM path. >>>> >>>> The problem was the removal of useless objects, because this also >>>> removed their vmstate. >>> >>> If you checkout at the removal commit (5bc8d26de20c), the vmstate has >>> been kept untouched. >>> >>>> >>>> The fix was adding the vmstate back as a dummy. >>> >>> Since the vmstate was kept I don't see why would we need a dummy. The >>> incoming migration stream would still have the state, only at a >>> different point in the stream. It's surprising to me that that would >>> cause an issue, but I'm not well versed in that code. >>> >>>> >>>> The QOM patch changes are *not* part of the problem. >>> >>> The only explanation I can come up with is that after the patch >>> migration has broken after a hotplug or similar operation. In such >>> situation, the preallocated state would always be present before the >>> patch, but sometimes not present after the patch in case, say, a >>> hot-unplug has taken away a cpu + ICPState. >> >> May be. It has been a while. >> >> For a better understanding, I tried to reproduce. QEMU 2.9 still >> compiles on a f38 (memfd_create needs a fix). However, a >> QEMU-2.9/pseries-2.9 machine won't start any recent OS because >> something broke the OS/platform negotiation process (CAS) ... > > Thanks for letting us know it still builds. That motivated me to try as > well. Although it took me a while because of python's brokenness. > > The issue was indeed the difference in preallocation vs. no > preallocation. I don't remember how to do hotplug in ppc anymore, but > maxcpus was sufficient to reproduce: > > src 2.9.0: > $ ./ppc64-softmmu/qemu-system-ppc64 -serial mon:stdio -nographic -vga none > -display none -accel tcg -machine pseries -smp 4,maxcpus=8 -m 256M > -nodefaults > > dst 2.9.0 prior to 46f7afa: > $ ./ppc64-softmmu/qemu-system-ppc64 -serial mon:stdio -display none > -accel tcg -machine pseries-2.9 -smp 4,maxcpus=8 -m 256M -nodefaults > -incoming defer > > (qemu) migrate_incoming tcp:localhost:1234 > (qemu) qemu-system-ppc64: Unknown savevm section or instance 'icp/server' 4 > qemu-system-ppc64: load of migration failed: Invalid argument > > As you can see, both use 4 cpus, but the src will try to migrate all 8 > icp/server instances. After the patch, the same migration works. > > So this was not related to the QOM path nor the QOM hierarchy, it was > just about having state created at machine init vs. cpu realize time. Ah ! of course, the machine is created in an init handler. Thanks Fabiano, for the diagnostic and the time you took to reproduce and analyze. I had this bad assumption of constraint between QOM vs. migration and any change made me uncomfortable. You cleared it. Good that we are deprecating this hack in a couple of versions. > It > might be wise to keep an eye on the QOM hierarchy anyway as it could > make these kinds of changes more evident during review. Clearly, Thanks, C. >> I removed the pre_2_10_has_unused_icps hack and worked around >> another migration compat issue in IOMMU ... Here are the last >> trace-events for migration : >> >> QEMU-mainline/pseries-2.9 -> QEMU-2.9/pseries-2.9 >> >> 1016464@1705053752.106417:vmstate_load spapr, spapr >> 1016464@1705053752.106419:vmstate_load_state spapr v3 >> 1016464@1705053752.106421:vmstate_load_state_field spapr:unused >> 1016464@1705053752.106424:vmstate_load_state_field spapr:rtc_offset >> 1016464@1705053752.106425:vmstate_load_state_field spapr:tb >> 1016464@1705053752.106427:vmstate_n_elems tb: 1 >> 1016464@1705053752.106429:vmstate_load_state timebase v1 >> 1016464@1705053752.106432:vmstate_load_state_field timebase:guest_timebase >> 1016464@1705053752.106433:vmstate_n_elems guest_timebase: 1 >> 1016464@1705053752.106435:vmstate_load_state_field timebase:time_of_the_day_ns >> 1016464@1705053752.106436:vmstate_n_elems time_of_the_day_ns: 1 >> 1016464@1705053752.106438:vmstate_subsection_load timebase >> 1016464@1705053752.106440:vmstate_subsection_load_bad timebase: spapr_option_vector_ov5_cas/(prefix) >> 1016464@1705053752.106442:vmstate_load_state_end timebase end/0 >> 1016464@1705053752.106443:vmstate_subsection_load spapr >> 1016464@1705053752.106445:vmstate_load_state spapr_option_vector_ov5_cas v1 >> 1016464@1705053752.106447:vmstate_load_state_field spapr_option_vector_ov5_cas:ov5_cas >> 1016464@1705053752.106448:vmstate_n_elems ov5_cas: 1 >> 1016464@1705053752.106450:vmstate_load_state spapr_option_vector v1 >> 1016464@1705053752.106452:vmstate_load_state_field spapr_option_vector:bitmap >> 1016464@1705053752.106454:vmstate_n_elems bitmap: 1 >> 1016464@1705053752.106456:vmstate_subsection_load spapr_option_vector >> 1016464@1705053752.106457:vmstate_subsection_load_bad spapr_option_vector: (short)/ >> 1016464@1705053752.106459:vmstate_load_state_end spapr_option_vector end/0 >> 1016464@1705053752.106461:vmstate_subsection_load spapr_option_vector_ov5_cas >> 1016464@1705053752.106462:vmstate_subsection_load_bad spapr_option_vector_ov5_cas: (short)/ >> 1016464@1705053752.106465:vmstate_load_state_end spapr_option_vector_ov5_cas end/0 >> 1016464@1705053752.106466:vmstate_subsection_load_bad spapr: spapr/cap/htm/(lookup) >> qemu-system-ppc64: error while loading state for instance 0x0 of device 'spapr' > > This trace this seems to be something else entirely. I'd be amazed if > migrating mainline vs. something that old worked at all, even on x86.