Message ID | 20210329175348.26859-1-xry111@mengyan1223.wang |
---|---|
State | New |
Headers | show |
Series | drm/amdgpu: fix an underflow on non-4KB-page systems | expand |
On 2021-03-29 20:04 +0200, Christian König wrote: > Am 29.03.21 um 19:53 schrieb Xℹ Ruoyao: > > If the initial value of `num_entires` (calculated at line 1654) is not > > an integral multiple of `AMDGPU_GPU_PAGES_IN_CPU_PAGE`, in line 1681 a > > value greater than the initial value will be assigned to it. That causes > > `start > last + 1` after line 1708. Then in the next iteration an > > underflow happens at line 1654. It causes message > > > > *ERROR* Couldn't update BO_VA (-12) > > > > printed in kernel log, and GPU hanging. > > > > Fortify the criteria of the loop to fix this issue. > > NAK the value of num_entries must always be a multiple of > AMDGPU_GPU_PAGES_IN_CPU_PAGE or otherwise we corrupt the page tables. > > How do you trigger that? Simply run "OpenGL area" from gtk3-demo (which just renders a triangle with GL) under Xorg, on MIPS64. See the BugLink. > Christian. > > > > > BugLink: https://gitlab.freedesktop.org/drm/amd/-/issues/1549 > > Fixes: a39f2a8d7066 ("drm/amdgpu: nuke amdgpu_vm_bo_split_mapping v2") > > Reported-by: Xi Ruoyao <xry111@mengyan1223.wang> > > Reported-by: Dan Horák <dan@danny.cz> > > Cc: stable@vger.kernel.org > > Signed-off-by: Xi Ruoyao <xry111@mengyan1223.wang> > > --- > > drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 2 +- > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c > > index ad91c0c3c423..cee0cc9c8085 100644 > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c > > @@ -1707,7 +1707,7 @@ static int amdgpu_vm_bo_update_mapping(struct > > amdgpu_device *adev, > > } > > start = tmp; > > > > - } while (unlikely(start != last + 1)); > > + } while (unlikely(start < last + 1)); > > > > r = vm->update_funcs->commit(¶ms, fence); > > > > > > base-commit: a5e13c6df0e41702d2b2c77c8ad41677ebb065b3 >
Am 29.03.21 um 20:08 schrieb Xi Ruoyao: > On 2021-03-29 20:04 +0200, Christian König wrote: >> Am 29.03.21 um 19:53 schrieb Xℹ Ruoyao: >>> If the initial value of `num_entires` (calculated at line 1654) is not >>> an integral multiple of `AMDGPU_GPU_PAGES_IN_CPU_PAGE`, in line 1681 a >>> value greater than the initial value will be assigned to it. That causes >>> `start > last + 1` after line 1708. Then in the next iteration an >>> underflow happens at line 1654. It causes message >>> >>> *ERROR* Couldn't update BO_VA (-12) >>> >>> printed in kernel log, and GPU hanging. >>> >>> Fortify the criteria of the loop to fix this issue. >> NAK the value of num_entries must always be a multiple of >> AMDGPU_GPU_PAGES_IN_CPU_PAGE or otherwise we corrupt the page tables. >> >> How do you trigger that? > Simply run "OpenGL area" from gtk3-demo (which just renders a triangle with GL) > under Xorg, on MIPS64. See the BugLink. You need to identify the root cause of this, most likely start or last are not a multiple of AMDGPU_GPU_PAGES_IN_CPU_PAGE. Christian. > >> Christian. >> >>> BugLink: https://gitlab.freedesktop.org/drm/amd/-/issues/1549 >>> Fixes: a39f2a8d7066 ("drm/amdgpu: nuke amdgpu_vm_bo_split_mapping v2") >>> Reported-by: Xi Ruoyao <xry111@mengyan1223.wang> >>> Reported-by: Dan Horák <dan@danny.cz> >>> Cc: stable@vger.kernel.org >>> Signed-off-by: Xi Ruoyao <xry111@mengyan1223.wang> >>> --- >>> drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 2 +- >>> 1 file changed, 1 insertion(+), 1 deletion(-) >>> >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c >>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c >>> index ad91c0c3c423..cee0cc9c8085 100644 >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c >>> @@ -1707,7 +1707,7 @@ static int amdgpu_vm_bo_update_mapping(struct >>> amdgpu_device *adev, >>> } >>> start = tmp; >>> >>> - } while (unlikely(start != last + 1)); >>> + } while (unlikely(start < last + 1)); >>> >>> r = vm->update_funcs->commit(¶ms, fence); >>> >>> >>> base-commit: a5e13c6df0e41702d2b2c77c8ad41677ebb065b3
On 2021-03-30 02:21 +0800, Xi Ruoyao wrote: > On 2021-03-29 20:10 +0200, Christian König wrote: > > You need to identify the root cause of this, most likely start or last > > are not a multiple of AMDGPU_GPU_PAGES_IN_CPU_PAGE. > > I printk'ed the value of start & last, they are all a multiple of 4 > (AMDGPU_GPU_PAGES_IN_CPU_PAGE). > > However... `num_entries = last - start + 1` so it became some irrational > thing... Either this line is wrong, or someone called > amdgpu_vm_bo_update_mapping with [start, last) instead of [start, last], which > is unexpected. I added BUG_ON(num_entries % AMDGPU_GPU_PAGES_IN_CPU_PAGE != 0), get: > Mar 30 02:28:27 xry111-A1901 kernel: [<ffffffff83c90750>] > amdgpu_vm_bo_update_mapping.constprop.0+0x218/0xae8 > Mar 30 02:28:27 xry111-A1901 kernel: [<ffffffff83c922b8>] > amdgpu_vm_bo_update+0x270/0x4c0 > Mar 30 02:28:27 xry111-A1901 kernel: [<ffffffff83c823ec>] > amdgpu_gem_va_ioctl+0x40c/0x430 > Mar 30 02:28:27 xry111-A1901 kernel: [<ffffffff83c1cae4>] > drm_ioctl_kernel+0xcc/0x120 > Mar 30 02:28:27 xry111-A1901 kernel: [<ffffffff83c1cfd8>] > drm_ioctl+0x220/0x408 > Mar 30 02:28:27 xry111-A1901 kernel: [<ffffffff83c5ea48>] > amdgpu_drm_ioctl+0x58/0x98 > Mar 30 02:28:27 xry111-A1901 kernel: [<ffffffff838feac4>] sys_ioctl+0xcc/0xe8 > Mar 30 02:28:27 xry111-A1901 kernel: [<ffffffff836e74f0>] > syscall_common+0x34/0x58 > > > > > > BugLink: https://gitlab.freedesktop.org/drm/amd/-/issues/1549 > > > > > Fixes: a39f2a8d7066 ("drm/amdgpu: nuke amdgpu_vm_bo_split_mapping v2") > > > > > Reported-by: Xi Ruoyao <xry111@mengyan1223.wang> > > > > > Reported-by: Dan Horák <dan@danny.cz> > > > > > Cc: stable@vger.kernel.org > > > > > Signed-off-by: Xi Ruoyao <xry111@mengyan1223.wang> > > > > > --- > > > > > drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 2 +- > > > > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > > > > > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c > > > > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c > > > > > index ad91c0c3c423..cee0cc9c8085 100644 > > > > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c > > > > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c > > > > > @@ -1707,7 +1707,7 @@ static int amdgpu_vm_bo_update_mapping(struct > > > > > amdgpu_device *adev, > > > > > } > > > > > start = tmp; > > > > > > > > > > - } while (unlikely(start != last + 1)); > > > > > + } while (unlikely(start < last + 1)); > > > > > > > > > > r = vm->update_funcs->commit(¶ms, fence); > > > > > > > > > > > > > > > base-commit: a5e13c6df0e41702d2b2c77c8ad41677ebb065b3 > > >
Hi Christian, I don't think there is any constraint implemented to ensure `num_entries % AMDGPU_GPU_PAGES_IN_CPU_PAGE == 0`. For example, in `amdgpu_vm_bo_map()`: /* validate the parameters */ if (saddr & AMDGPU_GPU_PAGE_MASK || offset & AMDGPU_GPU_PAGE_MASK || size == 0 || size & AMDGPU_GPU_PAGE_MASK) return -EINVAL; /* snip */ saddr /= AMDGPU_GPU_PAGE_SIZE; eaddr /= AMDGPU_GPU_PAGE_SIZE; /* snip */ mapping->start = saddr; mapping->last = eaddr; If we really want to ensure (mapping->last - mapping->start + 1) % AMDGPU_GPU_PAGES_IN_CPU_PAGE == 0, then we should replace "AMDGPU_GPU_PAGE_MASK" in "validate the parameters" with "PAGE_MASK". I tried it and it broke userspace: Xorg startup fails with EINVAL with this change. On 2021-03-30 02:30 +0800, Xi Ruoyao wrote: > On 2021-03-30 02:21 +0800, Xi Ruoyao wrote: > > On 2021-03-29 20:10 +0200, Christian König wrote: > > > You need to identify the root cause of this, most likely start or last > > > are not a multiple of AMDGPU_GPU_PAGES_IN_CPU_PAGE. > > > > I printk'ed the value of start & last, they are all a multiple of 4 > > (AMDGPU_GPU_PAGES_IN_CPU_PAGE). > > > > However... `num_entries = last - start + 1` so it became some irrational > > thing... Either this line is wrong, or someone called > > amdgpu_vm_bo_update_mapping with [start, last) instead of [start, last], which > > is unexpected. > > I added BUG_ON(num_entries % AMDGPU_GPU_PAGES_IN_CPU_PAGE != 0), get: > > > Mar 30 02:28:27 xry111-A1901 kernel: [<ffffffff83c90750>] > > amdgpu_vm_bo_update_mapping.constprop.0+0x218/0xae8 > > Mar 30 02:28:27 xry111-A1901 kernel: [<ffffffff83c922b8>] > > amdgpu_vm_bo_update+0x270/0x4c0 > > Mar 30 02:28:27 xry111-A1901 kernel: [<ffffffff83c823ec>] > > amdgpu_gem_va_ioctl+0x40c/0x430 > > Mar 30 02:28:27 xry111-A1901 kernel: [<ffffffff83c1cae4>] > > drm_ioctl_kernel+0xcc/0x120 > > Mar 30 02:28:27 xry111-A1901 kernel: [<ffffffff83c1cfd8>] > > drm_ioctl+0x220/0x408 > > Mar 30 02:28:27 xry111-A1901 kernel: [<ffffffff83c5ea48>] > > amdgpu_drm_ioctl+0x58/0x98 > > Mar 30 02:28:27 xry111-A1901 kernel: [<ffffffff838feac4>] sys_ioctl+0xcc/0xe8 > > Mar 30 02:28:27 xry111-A1901 kernel: [<ffffffff836e74f0>] > > syscall_common+0x34/0x58 > > > > > > > > > BugLink: https://gitlab.freedesktop.org/drm/amd/-/issues/1549 > > > > > > Fixes: a39f2a8d7066 ("drm/amdgpu: nuke amdgpu_vm_bo_split_mapping v2") > > > > > > Reported-by: Xi Ruoyao <xry111@mengyan1223.wang> > > > > > > Reported-by: Dan Horák <dan@danny.cz> > > > > > > Cc: stable@vger.kernel.org > > > > > > Signed-off-by: Xi Ruoyao <xry111@mengyan1223.wang> > > > > > > --- > > > > > > drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 2 +- > > > > > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > > > > > > > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c > > > > > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c > > > > > > index ad91c0c3c423..cee0cc9c8085 100644 > > > > > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c > > > > > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c > > > > > > @@ -1707,7 +1707,7 @@ static int amdgpu_vm_bo_update_mapping(struct > > > > > > amdgpu_device *adev, > > > > > > } > > > > > > start = tmp; > > > > > > > > > > > > - } while (unlikely(start != last + 1)); > > > > > > + } while (unlikely(start < last + 1)); > > > > > > > > > > > > r = vm->update_funcs->commit(¶ms, fence); > > > > > > > > > > > > > > > > > > base-commit: a5e13c6df0e41702d2b2c77c8ad41677ebb065b3 > > > > > >
On 2021-03-29 21:36 +0200, Christian König wrote: > Am 29.03.21 um 21:27 schrieb Xi Ruoyao: > > Hi Christian, > > > > I don't think there is any constraint implemented to ensure `num_entries % > > AMDGPU_GPU_PAGES_IN_CPU_PAGE == 0`. For example, in `amdgpu_vm_bo_map()`: > > > > /* validate the parameters */ > > if (saddr & AMDGPU_GPU_PAGE_MASK || offset & AMDGPU_GPU_PAGE_MASK > > || > > size == 0 || size & AMDGPU_GPU_PAGE_MASK) > > return -EINVAL; > > > > /* snip */ > > > > saddr /= AMDGPU_GPU_PAGE_SIZE; > > eaddr /= AMDGPU_GPU_PAGE_SIZE; > > > > /* snip */ > > > > mapping->start = saddr; > > mapping->last = eaddr; > > > > > > If we really want to ensure (mapping->last - mapping->start + 1) % > > AMDGPU_GPU_PAGES_IN_CPU_PAGE == 0, then we should replace > > "AMDGPU_GPU_PAGE_MASK" > > in "validate the parameters" with "PAGE_MASK". > > Yeah, good point. > > > I tried it and it broke userspace: Xorg startup fails with EINVAL with this > > change. > > Well in theory it is possible that we always fill the GPUVM on a 4k > basis while the native page size of the CPU is larger. Let me double > check the code. > > BTW: What code base are you based on? The code your post here is quite > outdated. Linus' tree. I'll go to sleep now (it's 03:39 here :( ), when I wake up I can try to fetch drm-next or something.
On 2021-03-30 03:40 +0800, Xi Ruoyao wrote: > On 2021-03-29 21:36 +0200, Christian König wrote: > > Am 29.03.21 um 21:27 schrieb Xi Ruoyao: > > > Hi Christian, > > > > > > I don't think there is any constraint implemented to ensure `num_entries % > > > AMDGPU_GPU_PAGES_IN_CPU_PAGE == 0`. For example, in `amdgpu_vm_bo_map()`: > > > > > > /* validate the parameters */ > > > if (saddr & AMDGPU_GPU_PAGE_MASK || offset & AMDGPU_GPU_PAGE_MASK > > > > > > > > size == 0 || size & AMDGPU_GPU_PAGE_MASK) > > > return -EINVAL; > > > > > > /* snip */ > > > > > > saddr /= AMDGPU_GPU_PAGE_SIZE; > > > eaddr /= AMDGPU_GPU_PAGE_SIZE; > > > > > > /* snip */ > > > > > > mapping->start = saddr; > > > mapping->last = eaddr; > > > > > > > > > If we really want to ensure (mapping->last - mapping->start + 1) % > > > AMDGPU_GPU_PAGES_IN_CPU_PAGE == 0, then we should replace > > > "AMDGPU_GPU_PAGE_MASK" > > > in "validate the parameters" with "PAGE_MASK". It should be "~PAGE_MASK", "PAGE_MASK" has an opposite convention of "AMDGPU_GPU_PAGE_MASK" :(. > > Yeah, good point. > > > > > I tried it and it broke userspace: Xorg startup fails with EINVAL with > > > this > > > change. > > > > Well in theory it is possible that we always fill the GPUVM on a 4k > > basis while the native page size of the CPU is larger. Let me double > > check the code. On my platform, there are two issues both causing the VM corruption. One is fixed by agd5f/linux@fe001e7. Another is in Mesa from userspace: it uses `dev_info->gart_page_size` as the alignment, but the kernel AMDGPU driver expects it to use `dev_info->virtual_address_alignment`. If we can make the change to fill the GPUVM on a 4k basis, we can fix this issue and make virtual_address_alignment = 4K. Otherwise, we should fortify the parameter validation, changing "AMDGPU_GPU_PAGE_MASK" to "~PAGE_MASK". Then the userspace will just get an EINVAL, instead of a slient VM corruption. And someone should tell Mesa developers to fix the code in this case. -- Xi Ruoyao <xry111@mengyan1223.wang> School of Aerospace Science and Technology, Xidian University
Am 30.03.21 um 14:04 schrieb Xi Ruoyao: > On 2021-03-30 03:40 +0800, Xi Ruoyao wrote: >> On 2021-03-29 21:36 +0200, Christian König wrote: >>> Am 29.03.21 um 21:27 schrieb Xi Ruoyao: >>>> Hi Christian, >>>> >>>> I don't think there is any constraint implemented to ensure `num_entries % >>>> AMDGPU_GPU_PAGES_IN_CPU_PAGE == 0`. For example, in `amdgpu_vm_bo_map()`: >>>> >>>> /* validate the parameters */ >>>> if (saddr & AMDGPU_GPU_PAGE_MASK || offset & AMDGPU_GPU_PAGE_MASK >>>> size == 0 || size & AMDGPU_GPU_PAGE_MASK) >>>> return -EINVAL; >>>> >>>> /* snip */ >>>> >>>> saddr /= AMDGPU_GPU_PAGE_SIZE; >>>> eaddr /= AMDGPU_GPU_PAGE_SIZE; >>>> >>>> /* snip */ >>>> >>>> mapping->start = saddr; >>>> mapping->last = eaddr; >>>> >>>> >>>> If we really want to ensure (mapping->last - mapping->start + 1) % >>>> AMDGPU_GPU_PAGES_IN_CPU_PAGE == 0, then we should replace >>>> "AMDGPU_GPU_PAGE_MASK" >>>> in "validate the parameters" with "PAGE_MASK". > It should be "~PAGE_MASK", "PAGE_MASK" has an opposite convention of > "AMDGPU_GPU_PAGE_MASK" :(. > >>> Yeah, good point. >>> >>>> I tried it and it broke userspace: Xorg startup fails with EINVAL with >>>> this >>>> change. >>> Well in theory it is possible that we always fill the GPUVM on a 4k >>> basis while the native page size of the CPU is larger. Let me double >>> check the code. > On my platform, there are two issues both causing the VM corruption. One is > fixed by agd5f/linux@fe001e7. What is fe001e7? A commit id? If yes then that is to short and I can't find it. > Another is in Mesa from userspace: it uses > `dev_info->gart_page_size` as the alignment, but the kernel AMDGPU driver > expects it to use `dev_info->virtual_address_alignment`. Mhm, looking at the kernel code I would rather say Mesa is correct and the kernel driver is broken. The gart_page_size is limited by the CPU page size, but the virtual_address_alignment isn't. > If we can make the change to fill the GPUVM on a 4k basis, we can fix this issue > and make virtual_address_alignment = 4K. Otherwise, we should fortify the > parameter validation, changing "AMDGPU_GPU_PAGE_MASK" to "~PAGE_MASK". Then the > userspace will just get an EINVAL, instead of a slient VM corruption. And > someone should tell Mesa developers to fix the code in this case. I rather see this as a kernel bug. Can you test if this code fragment fixes your issue: diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c index 64beb3399604..e1260b517e1b 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c @@ -780,7 +780,7 @@ int amdgpu_info_ioctl(struct drm_device *dev, void *data, struct drm_file *filp) } dev_info->virtual_address_alignment = max((int)PAGE_SIZE, AMDGPU_GPU_PAGE_SIZE); dev_info->pte_fragment_size = (1 << adev->vm_manager.fragment_size) * AMDGPU_GPU_PAGE_SIZE; - dev_info->gart_page_size = AMDGPU_GPU_PAGE_SIZE; + dev_info->gart_page_size = dev_info->virtual_address_alignment; dev_info->cu_active_number = adev->gfx.cu_info.number; dev_info->cu_ao_mask = adev->gfx.cu_info.ao_cu_mask; dev_info->ce_ram_size = adev->gfx.ce_ram_size; Thanks, Christian. > -- > Xi Ruoyao <xry111@mengyan1223.wang> > School of Aerospace Science and Technology, Xidian University >
On Tue, 30 Mar 2021 14:55:01 +0200 Christian König <christian.koenig@amd.com> wrote: > Am 30.03.21 um 14:04 schrieb Xi Ruoyao: > > On 2021-03-30 03:40 +0800, Xi Ruoyao wrote: > >> On 2021-03-29 21:36 +0200, Christian König wrote: > >>> Am 29.03.21 um 21:27 schrieb Xi Ruoyao: > >>>> Hi Christian, > >>>> > >>>> I don't think there is any constraint implemented to ensure `num_entries % > >>>> AMDGPU_GPU_PAGES_IN_CPU_PAGE == 0`. For example, in `amdgpu_vm_bo_map()`: > >>>> > >>>> /* validate the parameters */ > >>>> if (saddr & AMDGPU_GPU_PAGE_MASK || offset & AMDGPU_GPU_PAGE_MASK > >>>> size == 0 || size & AMDGPU_GPU_PAGE_MASK) > >>>> return -EINVAL; > >>>> > >>>> /* snip */ > >>>> > >>>> saddr /= AMDGPU_GPU_PAGE_SIZE; > >>>> eaddr /= AMDGPU_GPU_PAGE_SIZE; > >>>> > >>>> /* snip */ > >>>> > >>>> mapping->start = saddr; > >>>> mapping->last = eaddr; > >>>> > >>>> > >>>> If we really want to ensure (mapping->last - mapping->start + 1) % > >>>> AMDGPU_GPU_PAGES_IN_CPU_PAGE == 0, then we should replace > >>>> "AMDGPU_GPU_PAGE_MASK" > >>>> in "validate the parameters" with "PAGE_MASK". > > It should be "~PAGE_MASK", "PAGE_MASK" has an opposite convention of > > "AMDGPU_GPU_PAGE_MASK" :(. > > > >>> Yeah, good point. > >>> > >>>> I tried it and it broke userspace: Xorg startup fails with EINVAL with > >>>> this > >>>> change. > >>> Well in theory it is possible that we always fill the GPUVM on a 4k > >>> basis while the native page size of the CPU is larger. Let me double > >>> check the code. > > On my platform, there are two issues both causing the VM corruption. One is > > fixed by agd5f/linux@fe001e7. > > What is fe001e7? A commit id? If yes then that is to short and I can't > find it. it's a gitlab shortcut for https://gitlab.freedesktop.org/agd5f/linux/-/commit/fe001e70a55d0378328612be1fdc3abfc68b9ccc Dan > > > Another is in Mesa from userspace: it uses > > `dev_info->gart_page_size` as the alignment, but the kernel AMDGPU driver > > expects it to use `dev_info->virtual_address_alignment`. > > Mhm, looking at the kernel code I would rather say Mesa is correct and > the kernel driver is broken. > > The gart_page_size is limited by the CPU page size, but the > virtual_address_alignment isn't. > > > If we can make the change to fill the GPUVM on a 4k basis, we can fix this issue > > and make virtual_address_alignment = 4K. Otherwise, we should fortify the > > parameter validation, changing "AMDGPU_GPU_PAGE_MASK" to "~PAGE_MASK". Then the > > userspace will just get an EINVAL, instead of a slient VM corruption. And > > someone should tell Mesa developers to fix the code in this case. > > I rather see this as a kernel bug. Can you test if this code fragment > fixes your issue: > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c > b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c > index 64beb3399604..e1260b517e1b 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c > @@ -780,7 +780,7 @@ int amdgpu_info_ioctl(struct drm_device *dev, void > *data, struct drm_file *filp) > } > dev_info->virtual_address_alignment = > max((int)PAGE_SIZE, AMDGPU_GPU_PAGE_SIZE); > dev_info->pte_fragment_size = (1 << > adev->vm_manager.fragment_size) * AMDGPU_GPU_PAGE_SIZE; > - dev_info->gart_page_size = AMDGPU_GPU_PAGE_SIZE; > + dev_info->gart_page_size = > dev_info->virtual_address_alignment; > dev_info->cu_active_number = adev->gfx.cu_info.number; > dev_info->cu_ao_mask = adev->gfx.cu_info.ao_cu_mask; > dev_info->ce_ram_size = adev->gfx.ce_ram_size; > > > Thanks, > Christian. > > > -- > > Xi Ruoyao <xry111@mengyan1223.wang> > > School of Aerospace Science and Technology, Xidian University > > >
On 2021-03-30 14:55 +0200, Christian König wrote: > Am 30.03.21 um 14:04 schrieb Xi Ruoyao: > > On 2021-03-30 03:40 +0800, Xi Ruoyao wrote: > > > On 2021-03-29 21:36 +0200, Christian König wrote: > > > > Am 29.03.21 um 21:27 schrieb Xi Ruoyao: > > > > > Hi Christian, > > > > > > > > > > I don't think there is any constraint implemented to ensure `num_entries > > > > > % > > > > > AMDGPU_GPU_PAGES_IN_CPU_PAGE == 0`. For example, in > > > > > `amdgpu_vm_bo_map()`: > > > > > > > > > > /* validate the parameters */ > > > > > if (saddr & AMDGPU_GPU_PAGE_MASK || offset & > > > > > AMDGPU_GPU_PAGE_MASK > > > > > size == 0 || size & AMDGPU_GPU_PAGE_MASK) > > > > > return -EINVAL; > > > > > > > > > > /* snip */ > > > > > > > > > > saddr /= AMDGPU_GPU_PAGE_SIZE; > > > > > eaddr /= AMDGPU_GPU_PAGE_SIZE; > > > > > > > > > > /* snip */ > > > > > > > > > > mapping->start = saddr; > > > > > mapping->last = eaddr; > > > > > > > > > > > > > > > If we really want to ensure (mapping->last - mapping->start + 1) % > > > > > AMDGPU_GPU_PAGES_IN_CPU_PAGE == 0, then we should replace > > > > > "AMDGPU_GPU_PAGE_MASK" > > > > > in "validate the parameters" with "PAGE_MASK". > > It should be "~PAGE_MASK", "PAGE_MASK" has an opposite convention of > > "AMDGPU_GPU_PAGE_MASK" :(. > > > > > > Yeah, good point. > > > > > > > > > I tried it and it broke userspace: Xorg startup fails with EINVAL with > > > > > this > > > > > change. > > > > Well in theory it is possible that we always fill the GPUVM on a 4k > > > > basis while the native page size of the CPU is larger. Let me double > > > > check the code. > > On my platform, there are two issues both causing the VM corruption. One is > > fixed by agd5f/linux@fe001e7. > > What is fe001e7? A commit id? If yes then that is to short and I can't > find it. > > > Another is in Mesa from userspace: it uses > > `dev_info->gart_page_size` as the alignment, but the kernel AMDGPU driver > > expects it to use `dev_info->virtual_address_alignment`. > > Mhm, looking at the kernel code I would rather say Mesa is correct and > the kernel driver is broken. > > The gart_page_size is limited by the CPU page size, but the > virtual_address_alignment isn't. > > > If we can make the change to fill the GPUVM on a 4k basis, we can fix this > > issue > > and make virtual_address_alignment = 4K. Otherwise, we should fortify the > > parameter validation, changing "AMDGPU_GPU_PAGE_MASK" to "~PAGE_MASK". Then > > the > > userspace will just get an EINVAL, instead of a slient VM corruption. And > > someone should tell Mesa developers to fix the code in this case. > > I rather see this as a kernel bug. Can you test if this code fragment > fixes your issue: > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c > b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c > index 64beb3399604..e1260b517e1b 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c > @@ -780,7 +780,7 @@ int amdgpu_info_ioctl(struct drm_device *dev, void > *data, struct drm_file *filp) > } > dev_info->virtual_address_alignment = > max((int)PAGE_SIZE, AMDGPU_GPU_PAGE_SIZE); > dev_info->pte_fragment_size = (1 << > adev->vm_manager.fragment_size) * AMDGPU_GPU_PAGE_SIZE; > - dev_info->gart_page_size = AMDGPU_GPU_PAGE_SIZE; > + dev_info->gart_page_size = > dev_info->virtual_address_alignment; > dev_info->cu_active_number = adev->gfx.cu_info.number; > dev_info->cu_ao_mask = adev->gfx.cu_info.ao_cu_mask; > dev_info->ce_ram_size = adev->gfx.ce_ram_size; It works. I've seen it at https://github.com/xen0n/linux/commit/84ada72983838bd7ce54bc32f5d34ac5b5aae191 before (with a common sub-expression, though :). -- Xi Ruoyao <xry111@mengyan1223.wang> School of Aerospace Science and Technology, Xidian University
Am 30.03.21 um 15:00 schrieb Dan Horák: > On Tue, 30 Mar 2021 14:55:01 +0200 > Christian König <christian.koenig@amd.com> wrote: > >> Am 30.03.21 um 14:04 schrieb Xi Ruoyao: >>> On 2021-03-30 03:40 +0800, Xi Ruoyao wrote: >>>> On 2021-03-29 21:36 +0200, Christian König wrote: >>>>> Am 29.03.21 um 21:27 schrieb Xi Ruoyao: >>>>>> Hi Christian, >>>>>> >>>>>> I don't think there is any constraint implemented to ensure `num_entries % >>>>>> AMDGPU_GPU_PAGES_IN_CPU_PAGE == 0`. For example, in `amdgpu_vm_bo_map()`: >>>>>> >>>>>> /* validate the parameters */ >>>>>> if (saddr & AMDGPU_GPU_PAGE_MASK || offset & AMDGPU_GPU_PAGE_MASK >>>>>> size == 0 || size & AMDGPU_GPU_PAGE_MASK) >>>>>> return -EINVAL; >>>>>> >>>>>> /* snip */ >>>>>> >>>>>> saddr /= AMDGPU_GPU_PAGE_SIZE; >>>>>> eaddr /= AMDGPU_GPU_PAGE_SIZE; >>>>>> >>>>>> /* snip */ >>>>>> >>>>>> mapping->start = saddr; >>>>>> mapping->last = eaddr; >>>>>> >>>>>> >>>>>> If we really want to ensure (mapping->last - mapping->start + 1) % >>>>>> AMDGPU_GPU_PAGES_IN_CPU_PAGE == 0, then we should replace >>>>>> "AMDGPU_GPU_PAGE_MASK" >>>>>> in "validate the parameters" with "PAGE_MASK". >>> It should be "~PAGE_MASK", "PAGE_MASK" has an opposite convention of >>> "AMDGPU_GPU_PAGE_MASK" :(. >>> >>>>> Yeah, good point. >>>>> >>>>>> I tried it and it broke userspace: Xorg startup fails with EINVAL with >>>>>> this >>>>>> change. >>>>> Well in theory it is possible that we always fill the GPUVM on a 4k >>>>> basis while the native page size of the CPU is larger. Let me double >>>>> check the code. >>> On my platform, there are two issues both causing the VM corruption. One is >>> fixed by agd5f/linux@fe001e7. >> What is fe001e7? A commit id? If yes then that is to short and I can't >> find it. > it's a gitlab shortcut for > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitlab.freedesktop.org%2Fagd5f%2Flinux%2F-%2Fcommit%2Ffe001e70a55d0378328612be1fdc3abfc68b9ccc&data=04%7C01%7Cchristian.koenig%40amd.com%7Cd16d123aaa01420ebd0808d8f37bbf2f%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637527060812278536%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=5rFVLxSRnfHUGjhoiqx1e6SeROqbg4ZPef%2BxEvgv%2BTg%3D&reserved=0 Ah! Yes I would expect that this patch is fixing something in this use case. Thanks, Christian. > > > Dan >>> Another is in Mesa from userspace: it uses >>> `dev_info->gart_page_size` as the alignment, but the kernel AMDGPU driver >>> expects it to use `dev_info->virtual_address_alignment`. >> Mhm, looking at the kernel code I would rather say Mesa is correct and >> the kernel driver is broken. >> >> The gart_page_size is limited by the CPU page size, but the >> virtual_address_alignment isn't. >> >>> If we can make the change to fill the GPUVM on a 4k basis, we can fix this issue >>> and make virtual_address_alignment = 4K. Otherwise, we should fortify the >>> parameter validation, changing "AMDGPU_GPU_PAGE_MASK" to "~PAGE_MASK". Then the >>> userspace will just get an EINVAL, instead of a slient VM corruption. And >>> someone should tell Mesa developers to fix the code in this case. >> I rather see this as a kernel bug. Can you test if this code fragment >> fixes your issue: >> >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c >> b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c >> index 64beb3399604..e1260b517e1b 100644 >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c >> @@ -780,7 +780,7 @@ int amdgpu_info_ioctl(struct drm_device *dev, void >> *data, struct drm_file *filp) >> } >> dev_info->virtual_address_alignment = >> max((int)PAGE_SIZE, AMDGPU_GPU_PAGE_SIZE); >> dev_info->pte_fragment_size = (1 << >> adev->vm_manager.fragment_size) * AMDGPU_GPU_PAGE_SIZE; >> - dev_info->gart_page_size = AMDGPU_GPU_PAGE_SIZE; >> + dev_info->gart_page_size = >> dev_info->virtual_address_alignment; >> dev_info->cu_active_number = adev->gfx.cu_info.number; >> dev_info->cu_ao_mask = adev->gfx.cu_info.ao_cu_mask; >> dev_info->ce_ram_size = adev->gfx.ce_ram_size; >> >> >> Thanks, >> Christian. >> >>> -- >>> Xi Ruoyao <xry111@mengyan1223.wang> >>> School of Aerospace Science and Technology, Xidian University >>>
On 2021-03-30 21:02 +0800, Xi Ruoyao wrote: > On 2021-03-30 14:55 +0200, Christian König wrote: > > > > I rather see this as a kernel bug. Can you test if this code fragment > > fixes your issue: > > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c > > index 64beb3399604..e1260b517e1b 100644 > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c > > @@ -780,7 +780,7 @@ int amdgpu_info_ioctl(struct drm_device *dev, void > > *data, struct drm_file *filp) > > } > > dev_info->virtual_address_alignment = > > max((int)PAGE_SIZE, AMDGPU_GPU_PAGE_SIZE); > > dev_info->pte_fragment_size = (1 << > > adev->vm_manager.fragment_size) * AMDGPU_GPU_PAGE_SIZE; > > - dev_info->gart_page_size = AMDGPU_GPU_PAGE_SIZE; > > + dev_info->gart_page_size = > > dev_info->virtual_address_alignment; > > dev_info->cu_active_number = adev->gfx.cu_info.number; > > dev_info->cu_ao_mask = adev->gfx.cu_info.ao_cu_mask; > > dev_info->ce_ram_size = adev->gfx.ce_ram_size; > > It works. I've seen it at > https://github.com/xen0n/linux/commit/84ada72983838bd7ce54bc32f5d34ac5b5aae191 > before (with a common sub-expression, though :). Some comment: on an old version of Fedora ported by Loongson, Xorg just hangs without this commit. But on the system I built from source, I didn't see any issue before Linux 5.11. So I misbelieved that it was something already fixed. Dan: you can try it on your PPC 64 with non-4K page as well. -- Xi Ruoyao <xry111@mengyan1223.wang> School of Aerospace Science and Technology, Xidian University
On Tue, 30 Mar 2021 21:09:12 +0800 Xi Ruoyao <xry111@mengyan1223.wang> wrote: > On 2021-03-30 21:02 +0800, Xi Ruoyao wrote: > > On 2021-03-30 14:55 +0200, Christian König wrote: > > > > > > I rather see this as a kernel bug. Can you test if this code fragment > > > fixes your issue: > > > > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c > > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c > > > index 64beb3399604..e1260b517e1b 100644 > > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c > > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c > > > @@ -780,7 +780,7 @@ int amdgpu_info_ioctl(struct drm_device *dev, void > > > *data, struct drm_file *filp) > > > } > > > dev_info->virtual_address_alignment = > > > max((int)PAGE_SIZE, AMDGPU_GPU_PAGE_SIZE); > > > dev_info->pte_fragment_size = (1 << > > > adev->vm_manager.fragment_size) * AMDGPU_GPU_PAGE_SIZE; > > > - dev_info->gart_page_size = AMDGPU_GPU_PAGE_SIZE; > > > + dev_info->gart_page_size = > > > dev_info->virtual_address_alignment; > > > dev_info->cu_active_number = adev->gfx.cu_info.number; > > > dev_info->cu_ao_mask = adev->gfx.cu_info.ao_cu_mask; > > > dev_info->ce_ram_size = adev->gfx.ce_ram_size; > > > > It works. I've seen it at > > https://github.com/xen0n/linux/commit/84ada72983838bd7ce54bc32f5d34ac5b5aae191 > > before (with a common sub-expression, though :). > > Some comment: on an old version of Fedora ported by Loongson, Xorg just hangs > without this commit. But on the system I built from source, I didn't see any > issue before Linux 5.11. So I misbelieved that it was something already fixed. > > Dan: you can try it on your PPC 64 with non-4K page as well. yup, looks good here as well, ppc64le (Power9) system with 64KB pages Dan
Am 30.03.21 um 15:23 schrieb Dan Horák: > On Tue, 30 Mar 2021 21:09:12 +0800 > Xi Ruoyao <xry111@mengyan1223.wang> wrote: > >> On 2021-03-30 21:02 +0800, Xi Ruoyao wrote: >>> On 2021-03-30 14:55 +0200, Christian König wrote: >>>> I rather see this as a kernel bug. Can you test if this code fragment >>>> fixes your issue: >>>> >>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c >>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c >>>> index 64beb3399604..e1260b517e1b 100644 >>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c >>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c >>>> @@ -780,7 +780,7 @@ int amdgpu_info_ioctl(struct drm_device *dev, void >>>> *data, struct drm_file *filp) >>>> } >>>> dev_info->virtual_address_alignment = >>>> max((int)PAGE_SIZE, AMDGPU_GPU_PAGE_SIZE); >>>> dev_info->pte_fragment_size = (1 << >>>> adev->vm_manager.fragment_size) * AMDGPU_GPU_PAGE_SIZE; >>>> - dev_info->gart_page_size = AMDGPU_GPU_PAGE_SIZE; >>>> + dev_info->gart_page_size = >>>> dev_info->virtual_address_alignment; >>>> dev_info->cu_active_number = adev->gfx.cu_info.number; >>>> dev_info->cu_ao_mask = adev->gfx.cu_info.ao_cu_mask; >>>> dev_info->ce_ram_size = adev->gfx.ce_ram_size; >>> It works. I've seen it at >>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fxen0n%2Flinux%2Fcommit%2F84ada72983838bd7ce54bc32f5d34ac5b5aae191&data=04%7C01%7Cchristian.koenig%40amd.com%7Cf37fddf20a8847edf67808d8f37ef23c%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637527074118791321%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=DZnmee38NGpiWRMX5LmlxOhxAzBMhAusnAWNnCxXTJ0%3D&reserved=0 >>> before (with a common sub-expression, though :). >> Some comment: on an old version of Fedora ported by Loongson, Xorg just hangs >> without this commit. But on the system I built from source, I didn't see any >> issue before Linux 5.11. So I misbelieved that it was something already fixed. >> >> Dan: you can try it on your PPC 64 with non-4K page as well. > yup, looks good here as well, ppc64le (Power9) system with 64KB pages Mhm, as far as I can say this patch never made it to us. Can you please send it to the mailing list with me on CC? Thanks, Christian. > > > Dan
On 2021-03-30 15:24 +0200, Christian König wrote: > Am 30.03.21 um 15:23 schrieb Dan Horák: > > On Tue, 30 Mar 2021 21:09:12 +0800 > > Xi Ruoyao <xry111@mengyan1223.wang> wrote: > > > > > On 2021-03-30 21:02 +0800, Xi Ruoyao wrote: > > > > On 2021-03-30 14:55 +0200, Christian König wrote: > > > > > I rather see this as a kernel bug. Can you test if this code fragment > > > > > fixes your issue: > > > > > > > > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c > > > > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c > > > > > index 64beb3399604..e1260b517e1b 100644 > > > > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c > > > > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c > > > > > @@ -780,7 +780,7 @@ int amdgpu_info_ioctl(struct drm_device *dev, void > > > > > *data, struct drm_file *filp) > > > > > } > > > > > dev_info->virtual_address_alignment = > > > > > max((int)PAGE_SIZE, AMDGPU_GPU_PAGE_SIZE); > > > > > dev_info->pte_fragment_size = (1 << > > > > > adev->vm_manager.fragment_size) * AMDGPU_GPU_PAGE_SIZE; > > > > > - dev_info->gart_page_size = AMDGPU_GPU_PAGE_SIZE; > > > > > + dev_info->gart_page_size = > > > > > dev_info->virtual_address_alignment; > > > > > dev_info->cu_active_number = adev- > > > > > >gfx.cu_info.number; > > > > > dev_info->cu_ao_mask = adev->gfx.cu_info.ao_cu_mask; > > > > > dev_info->ce_ram_size = adev->gfx.ce_ram_size; > > > > It works. I've seen it at > > > > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fxen0n%2Flinux%2Fcommit%2F84ada72983838bd7ce54bc32f5d34ac5b5aae191&data=04%7C01%7Cchristian.koenig%40amd.com%7Cf37fddf20a8847edf67808d8f37ef23c%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637527074118791321%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=DZnmee38NGpiWRMX5LmlxOhxAzBMhAusnAWNnCxXTJ0%3D&reserved=0 > > > > before (with a common sub-expression, though :). > > > Some comment: on an old version of Fedora ported by Loongson, Xorg just > > > hangs > > > without this commit. But on the system I built from source, I didn't see > > > any > > > issue before Linux 5.11. So I misbelieved that it was something already > > > fixed. > > > > > > Dan: you can try it on your PPC 64 with non-4K page as well. > > yup, looks good here as well, ppc64le (Power9) system with 64KB pages > > Mhm, as far as I can say this patch never made it to us. I think maybe its author considers it a "workaround" like me, as there is already a "virtual_address_alignment" which seems correct. > Can you please send it to the mailing list with me on CC? I've sent it, together with my patch for using ~PAGE_MASK in parameter validation.
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c index ad91c0c3c423..cee0cc9c8085 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c @@ -1707,7 +1707,7 @@ static int amdgpu_vm_bo_update_mapping(struct amdgpu_device *adev, } start = tmp; - } while (unlikely(start != last + 1)); + } while (unlikely(start < last + 1)); r = vm->update_funcs->commit(¶ms, fence);
If the initial value of `num_entires` (calculated at line 1654) is not an integral multiple of `AMDGPU_GPU_PAGES_IN_CPU_PAGE`, in line 1681 a value greater than the initial value will be assigned to it. That causes `start > last + 1` after line 1708. Then in the next iteration an underflow happens at line 1654. It causes message *ERROR* Couldn't update BO_VA (-12) printed in kernel log, and GPU hanging. Fortify the criteria of the loop to fix this issue. BugLink: https://gitlab.freedesktop.org/drm/amd/-/issues/1549 Fixes: a39f2a8d7066 ("drm/amdgpu: nuke amdgpu_vm_bo_split_mapping v2") Reported-by: Xi Ruoyao <xry111@mengyan1223.wang> Reported-by: Dan Horák <dan@danny.cz> Cc: stable@vger.kernel.org Signed-off-by: Xi Ruoyao <xry111@mengyan1223.wang> --- drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) base-commit: a5e13c6df0e41702d2b2c77c8ad41677ebb065b3