diff mbox series

drm/amdgpu: fix an underflow on non-4KB-page systems

Message ID 20210329175348.26859-1-xry111@mengyan1223.wang
State New
Headers show
Series drm/amdgpu: fix an underflow on non-4KB-page systems | expand

Commit Message

Xi Ruoyao March 29, 2021, 5:53 p.m. UTC
If the initial value of `num_entires` (calculated at line 1654) is not
an integral multiple of `AMDGPU_GPU_PAGES_IN_CPU_PAGE`, in line 1681 a
value greater than the initial value will be assigned to it.  That causes
`start > last + 1` after line 1708.  Then in the next iteration an
underflow happens at line 1654.  It causes message

    *ERROR* Couldn't update BO_VA (-12)

printed in kernel log, and GPU hanging.

Fortify the criteria of the loop to fix this issue.

BugLink: https://gitlab.freedesktop.org/drm/amd/-/issues/1549
Fixes: a39f2a8d7066 ("drm/amdgpu: nuke amdgpu_vm_bo_split_mapping v2")
Reported-by: Xi Ruoyao <xry111@mengyan1223.wang>
Reported-by: Dan Horák <dan@danny.cz>
Cc: stable@vger.kernel.org
Signed-off-by: Xi Ruoyao <xry111@mengyan1223.wang>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)


base-commit: a5e13c6df0e41702d2b2c77c8ad41677ebb065b3

Comments

Xi Ruoyao March 29, 2021, 6:08 p.m. UTC | #1
On 2021-03-29 20:04 +0200, Christian König wrote:
> Am 29.03.21 um 19:53 schrieb Xℹ Ruoyao:
> > If the initial value of `num_entires` (calculated at line 1654) is not
> > an integral multiple of `AMDGPU_GPU_PAGES_IN_CPU_PAGE`, in line 1681 a
> > value greater than the initial value will be assigned to it.  That causes
> > `start > last + 1` after line 1708.  Then in the next iteration an
> > underflow happens at line 1654.  It causes message
> > 
> >      *ERROR* Couldn't update BO_VA (-12)
> > 
> > printed in kernel log, and GPU hanging.
> > 
> > Fortify the criteria of the loop to fix this issue.
> 
> NAK the value of num_entries must always be a multiple of 
> AMDGPU_GPU_PAGES_IN_CPU_PAGE or otherwise we corrupt the page tables.
> 
> How do you trigger that?

Simply run "OpenGL area" from gtk3-demo (which just renders a triangle with GL)
under Xorg, on MIPS64.  See the BugLink.

> Christian.
> 
> > 
> > BugLink: https://gitlab.freedesktop.org/drm/amd/-/issues/1549
> > Fixes: a39f2a8d7066 ("drm/amdgpu: nuke amdgpu_vm_bo_split_mapping v2")
> > Reported-by: Xi Ruoyao <xry111@mengyan1223.wang>
> > Reported-by: Dan Horák <dan@danny.cz>
> > Cc: stable@vger.kernel.org
> > Signed-off-by: Xi Ruoyao <xry111@mengyan1223.wang>
> > ---
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 2 +-
> >   1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> > index ad91c0c3c423..cee0cc9c8085 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> > @@ -1707,7 +1707,7 @@ static int amdgpu_vm_bo_update_mapping(struct
> > amdgpu_device *adev,
> >                 }
> >                 start = tmp;
> >   
> > -       } while (unlikely(start != last + 1));
> > +       } while (unlikely(start < last + 1));
> >   
> >         r = vm->update_funcs->commit(&params, fence);
> >   
> > 
> > base-commit: a5e13c6df0e41702d2b2c77c8ad41677ebb065b3
>
Christian König March 29, 2021, 6:10 p.m. UTC | #2
Am 29.03.21 um 20:08 schrieb Xi Ruoyao:
> On 2021-03-29 20:04 +0200, Christian König wrote:
>> Am 29.03.21 um 19:53 schrieb Xℹ Ruoyao:
>>> If the initial value of `num_entires` (calculated at line 1654) is not
>>> an integral multiple of `AMDGPU_GPU_PAGES_IN_CPU_PAGE`, in line 1681 a
>>> value greater than the initial value will be assigned to it.  That causes
>>> `start > last + 1` after line 1708.  Then in the next iteration an
>>> underflow happens at line 1654.  It causes message
>>>
>>>       *ERROR* Couldn't update BO_VA (-12)
>>>
>>> printed in kernel log, and GPU hanging.
>>>
>>> Fortify the criteria of the loop to fix this issue.
>> NAK the value of num_entries must always be a multiple of
>> AMDGPU_GPU_PAGES_IN_CPU_PAGE or otherwise we corrupt the page tables.
>>
>> How do you trigger that?
> Simply run "OpenGL area" from gtk3-demo (which just renders a triangle with GL)
> under Xorg, on MIPS64.  See the BugLink.

You need to identify the root cause of this, most likely start or last 
are not a multiple of AMDGPU_GPU_PAGES_IN_CPU_PAGE.

Christian.

>
>> Christian.
>>
>>> BugLink: https://gitlab.freedesktop.org/drm/amd/-/issues/1549
>>> Fixes: a39f2a8d7066 ("drm/amdgpu: nuke amdgpu_vm_bo_split_mapping v2")
>>> Reported-by: Xi Ruoyao <xry111@mengyan1223.wang>
>>> Reported-by: Dan Horák <dan@danny.cz>
>>> Cc: stable@vger.kernel.org
>>> Signed-off-by: Xi Ruoyao <xry111@mengyan1223.wang>
>>> ---
>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 2 +-
>>>    1 file changed, 1 insertion(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>>> index ad91c0c3c423..cee0cc9c8085 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>>> @@ -1707,7 +1707,7 @@ static int amdgpu_vm_bo_update_mapping(struct
>>> amdgpu_device *adev,
>>>                  }
>>>                  start = tmp;
>>>    
>>> -       } while (unlikely(start != last + 1));
>>> +       } while (unlikely(start < last + 1));
>>>    
>>>          r = vm->update_funcs->commit(&params, fence);
>>>    
>>>
>>> base-commit: a5e13c6df0e41702d2b2c77c8ad41677ebb065b3
Xi Ruoyao March 29, 2021, 6:30 p.m. UTC | #3
On 2021-03-30 02:21 +0800, Xi Ruoyao wrote:
> On 2021-03-29 20:10 +0200, Christian König wrote:
> > You need to identify the root cause of this, most likely start or last 
> > are not a multiple of AMDGPU_GPU_PAGES_IN_CPU_PAGE.
> 
> I printk'ed the value of start & last, they are all a multiple of 4
> (AMDGPU_GPU_PAGES_IN_CPU_PAGE).
> 
> However... `num_entries = last - start + 1` so it became some irrational
> thing...  Either this line is wrong, or someone called
> amdgpu_vm_bo_update_mapping with [start, last) instead of [start, last], which
> is unexpected.

I added BUG_ON(num_entries % AMDGPU_GPU_PAGES_IN_CPU_PAGE != 0), get:

> Mar 30 02:28:27 xry111-A1901 kernel: [<ffffffff83c90750>]
> amdgpu_vm_bo_update_mapping.constprop.0+0x218/0xae8
> Mar 30 02:28:27 xry111-A1901 kernel: [<ffffffff83c922b8>]
> amdgpu_vm_bo_update+0x270/0x4c0
> Mar 30 02:28:27 xry111-A1901 kernel: [<ffffffff83c823ec>]
> amdgpu_gem_va_ioctl+0x40c/0x430
> Mar 30 02:28:27 xry111-A1901 kernel: [<ffffffff83c1cae4>]
> drm_ioctl_kernel+0xcc/0x120
> Mar 30 02:28:27 xry111-A1901 kernel: [<ffffffff83c1cfd8>]
> drm_ioctl+0x220/0x408
> Mar 30 02:28:27 xry111-A1901 kernel: [<ffffffff83c5ea48>]
> amdgpu_drm_ioctl+0x58/0x98
> Mar 30 02:28:27 xry111-A1901 kernel: [<ffffffff838feac4>] sys_ioctl+0xcc/0xe8
> Mar 30 02:28:27 xry111-A1901 kernel: [<ffffffff836e74f0>]
> syscall_common+0x34/0x58
> 

> > > > > BugLink: https://gitlab.freedesktop.org/drm/amd/-/issues/1549
> > > > > Fixes: a39f2a8d7066 ("drm/amdgpu: nuke amdgpu_vm_bo_split_mapping v2")
> > > > > Reported-by: Xi Ruoyao <xry111@mengyan1223.wang>
> > > > > Reported-by: Dan Horák <dan@danny.cz>
> > > > > Cc: stable@vger.kernel.org
> > > > > Signed-off-by: Xi Ruoyao <xry111@mengyan1223.wang>
> > > > > ---
> > > > >    drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 2 +-
> > > > >    1 file changed, 1 insertion(+), 1 deletion(-)
> > > > > 
> > > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> > > > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> > > > > index ad91c0c3c423..cee0cc9c8085 100644
> > > > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> > > > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> > > > > @@ -1707,7 +1707,7 @@ static int amdgpu_vm_bo_update_mapping(struct
> > > > > amdgpu_device *adev,
> > > > >                  }
> > > > >                  start = tmp;
> > > > >    
> > > > > -       } while (unlikely(start != last + 1));
> > > > > +       } while (unlikely(start < last + 1));
> > > > >    
> > > > >          r = vm->update_funcs->commit(&params, fence);
> > > > >    
> > > > > 
> > > > > base-commit: a5e13c6df0e41702d2b2c77c8ad41677ebb065b3
> > 
>
Xi Ruoyao March 29, 2021, 7:27 p.m. UTC | #4
Hi Christian,

I don't think there is any constraint implemented to ensure `num_entries %
AMDGPU_GPU_PAGES_IN_CPU_PAGE == 0`.  For example, in `amdgpu_vm_bo_map()`:

        /* validate the parameters */
        if (saddr & AMDGPU_GPU_PAGE_MASK || offset & AMDGPU_GPU_PAGE_MASK ||
            size == 0 || size & AMDGPU_GPU_PAGE_MASK)
                return -EINVAL;

/* snip */

        saddr /= AMDGPU_GPU_PAGE_SIZE;
        eaddr /= AMDGPU_GPU_PAGE_SIZE;

/* snip */

        mapping->start = saddr;
        mapping->last = eaddr;


If we really want to ensure (mapping->last - mapping->start + 1) %
AMDGPU_GPU_PAGES_IN_CPU_PAGE == 0, then we should replace "AMDGPU_GPU_PAGE_MASK"
in "validate the parameters" with "PAGE_MASK".

I tried it and it broke userspace: Xorg startup fails with EINVAL with this
change.

On 2021-03-30 02:30 +0800, Xi Ruoyao wrote:
> On 2021-03-30 02:21 +0800, Xi Ruoyao wrote:
> > On 2021-03-29 20:10 +0200, Christian König wrote:
> > > You need to identify the root cause of this, most likely start or last 
> > > are not a multiple of AMDGPU_GPU_PAGES_IN_CPU_PAGE.
> > 
> > I printk'ed the value of start & last, they are all a multiple of 4
> > (AMDGPU_GPU_PAGES_IN_CPU_PAGE).
> > 
> > However... `num_entries = last - start + 1` so it became some irrational
> > thing...  Either this line is wrong, or someone called
> > amdgpu_vm_bo_update_mapping with [start, last) instead of [start, last], which
> > is unexpected.
> 
> I added BUG_ON(num_entries % AMDGPU_GPU_PAGES_IN_CPU_PAGE != 0), get:
> 
> > Mar 30 02:28:27 xry111-A1901 kernel: [<ffffffff83c90750>]
> > amdgpu_vm_bo_update_mapping.constprop.0+0x218/0xae8
> > Mar 30 02:28:27 xry111-A1901 kernel: [<ffffffff83c922b8>]
> > amdgpu_vm_bo_update+0x270/0x4c0
> > Mar 30 02:28:27 xry111-A1901 kernel: [<ffffffff83c823ec>]
> > amdgpu_gem_va_ioctl+0x40c/0x430
> > Mar 30 02:28:27 xry111-A1901 kernel: [<ffffffff83c1cae4>]
> > drm_ioctl_kernel+0xcc/0x120
> > Mar 30 02:28:27 xry111-A1901 kernel: [<ffffffff83c1cfd8>]
> > drm_ioctl+0x220/0x408
> > Mar 30 02:28:27 xry111-A1901 kernel: [<ffffffff83c5ea48>]
> > amdgpu_drm_ioctl+0x58/0x98
> > Mar 30 02:28:27 xry111-A1901 kernel: [<ffffffff838feac4>] sys_ioctl+0xcc/0xe8
> > Mar 30 02:28:27 xry111-A1901 kernel: [<ffffffff836e74f0>]
> > syscall_common+0x34/0x58
> > 
> 
> > > > > > BugLink: https://gitlab.freedesktop.org/drm/amd/-/issues/1549
> > > > > > Fixes: a39f2a8d7066 ("drm/amdgpu: nuke amdgpu_vm_bo_split_mapping v2")
> > > > > > Reported-by: Xi Ruoyao <xry111@mengyan1223.wang>
> > > > > > Reported-by: Dan Horák <dan@danny.cz>
> > > > > > Cc: stable@vger.kernel.org
> > > > > > Signed-off-by: Xi Ruoyao <xry111@mengyan1223.wang>
> > > > > > ---
> > > > > >    drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 2 +-
> > > > > >    1 file changed, 1 insertion(+), 1 deletion(-)
> > > > > > 
> > > > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> > > > > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> > > > > > index ad91c0c3c423..cee0cc9c8085 100644
> > > > > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> > > > > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> > > > > > @@ -1707,7 +1707,7 @@ static int amdgpu_vm_bo_update_mapping(struct
> > > > > > amdgpu_device *adev,
> > > > > >                  }
> > > > > >                  start = tmp;
> > > > > >    
> > > > > > -       } while (unlikely(start != last + 1));
> > > > > > +       } while (unlikely(start < last + 1));
> > > > > >    
> > > > > >          r = vm->update_funcs->commit(&params, fence);
> > > > > >    
> > > > > > 
> > > > > > base-commit: a5e13c6df0e41702d2b2c77c8ad41677ebb065b3
> > > 
> > 
>
Xi Ruoyao March 29, 2021, 7:40 p.m. UTC | #5
On 2021-03-29 21:36 +0200, Christian König wrote:
> Am 29.03.21 um 21:27 schrieb Xi Ruoyao:
> > Hi Christian,
> > 
> > I don't think there is any constraint implemented to ensure `num_entries %
> > AMDGPU_GPU_PAGES_IN_CPU_PAGE == 0`.  For example, in `amdgpu_vm_bo_map()`:
> > 
> >          /* validate the parameters */
> >          if (saddr & AMDGPU_GPU_PAGE_MASK || offset & AMDGPU_GPU_PAGE_MASK
> > ||
> >              size == 0 || size & AMDGPU_GPU_PAGE_MASK)
> >                  return -EINVAL;
> > 
> > /* snip */
> > 
> >          saddr /= AMDGPU_GPU_PAGE_SIZE;
> >          eaddr /= AMDGPU_GPU_PAGE_SIZE;
> > 
> > /* snip */
> > 
> >          mapping->start = saddr;
> >          mapping->last = eaddr;
> > 
> > 
> > If we really want to ensure (mapping->last - mapping->start + 1) %
> > AMDGPU_GPU_PAGES_IN_CPU_PAGE == 0, then we should replace
> > "AMDGPU_GPU_PAGE_MASK"
> > in "validate the parameters" with "PAGE_MASK".
> 
> Yeah, good point.
> 
> > I tried it and it broke userspace: Xorg startup fails with EINVAL with this
> > change.
> 
> Well in theory it is possible that we always fill the GPUVM on a 4k 
> basis while the native page size of the CPU is larger. Let me double 
> check the code.
> 
> BTW: What code base are you based on? The code your post here is quite 
> outdated.

Linus' tree.

I'll go to sleep now (it's 03:39 here :( ), when I wake up I can try to fetch
drm-next or something.
Xi Ruoyao March 30, 2021, 12:04 p.m. UTC | #6
On 2021-03-30 03:40 +0800, Xi Ruoyao wrote:
> On 2021-03-29 21:36 +0200, Christian König wrote:

> > Am 29.03.21 um 21:27 schrieb Xi Ruoyao:

> > > Hi Christian,

> > > 

> > > I don't think there is any constraint implemented to ensure `num_entries %

> > > AMDGPU_GPU_PAGES_IN_CPU_PAGE == 0`.  For example, in `amdgpu_vm_bo_map()`:

> > > 

> > >          /* validate the parameters */

> > >          if (saddr & AMDGPU_GPU_PAGE_MASK || offset & AMDGPU_GPU_PAGE_MASK

> > > > > 

> > >              size == 0 || size & AMDGPU_GPU_PAGE_MASK)

> > >                  return -EINVAL;

> > > 

> > > /* snip */

> > > 

> > >          saddr /= AMDGPU_GPU_PAGE_SIZE;

> > >          eaddr /= AMDGPU_GPU_PAGE_SIZE;

> > > 

> > > /* snip */

> > > 

> > >          mapping->start = saddr;

> > >          mapping->last = eaddr;

> > > 

> > > 

> > > If we really want to ensure (mapping->last - mapping->start + 1) %

> > > AMDGPU_GPU_PAGES_IN_CPU_PAGE == 0, then we should replace

> > > "AMDGPU_GPU_PAGE_MASK"

> > > in "validate the parameters" with "PAGE_MASK".


It should be "~PAGE_MASK", "PAGE_MASK" has an opposite convention of
"AMDGPU_GPU_PAGE_MASK" :(.

> > Yeah, good point.

> > 

> > > I tried it and it broke userspace: Xorg startup fails with EINVAL with

> > > this

> > > change.

> > 

> > Well in theory it is possible that we always fill the GPUVM on a 4k 

> > basis while the native page size of the CPU is larger. Let me double 

> > check the code.


On my platform, there are two issues both causing the VM corruption.  One is
fixed by agd5f/linux@fe001e7.  Another is in Mesa from userspace:  it uses
`dev_info->gart_page_size` as the alignment, but the kernel AMDGPU driver
expects it to use `dev_info->virtual_address_alignment`.

If we can make the change to fill the GPUVM on a 4k basis, we can fix this issue
and make virtual_address_alignment = 4K.  Otherwise, we should fortify the
parameter validation, changing "AMDGPU_GPU_PAGE_MASK" to "~PAGE_MASK".  Then the
userspace will just get an EINVAL, instead of a slient VM corruption.  And
someone should tell Mesa developers to fix the code in this case.
--
Xi Ruoyao <xry111@mengyan1223.wang>
School of Aerospace Science and Technology, Xidian University
Christian König March 30, 2021, 12:55 p.m. UTC | #7
Am 30.03.21 um 14:04 schrieb Xi Ruoyao:
> On 2021-03-30 03:40 +0800, Xi Ruoyao wrote:
>> On 2021-03-29 21:36 +0200, Christian König wrote:
>>> Am 29.03.21 um 21:27 schrieb Xi Ruoyao:
>>>> Hi Christian,
>>>>
>>>> I don't think there is any constraint implemented to ensure `num_entries %
>>>> AMDGPU_GPU_PAGES_IN_CPU_PAGE == 0`.  For example, in `amdgpu_vm_bo_map()`:
>>>>
>>>>           /* validate the parameters */
>>>>           if (saddr & AMDGPU_GPU_PAGE_MASK || offset & AMDGPU_GPU_PAGE_MASK
>>>>               size == 0 || size & AMDGPU_GPU_PAGE_MASK)
>>>>                   return -EINVAL;
>>>>
>>>> /* snip */
>>>>
>>>>           saddr /= AMDGPU_GPU_PAGE_SIZE;
>>>>           eaddr /= AMDGPU_GPU_PAGE_SIZE;
>>>>
>>>> /* snip */
>>>>
>>>>           mapping->start = saddr;
>>>>           mapping->last = eaddr;
>>>>
>>>>
>>>> If we really want to ensure (mapping->last - mapping->start + 1) %
>>>> AMDGPU_GPU_PAGES_IN_CPU_PAGE == 0, then we should replace
>>>> "AMDGPU_GPU_PAGE_MASK"
>>>> in "validate the parameters" with "PAGE_MASK".
> It should be "~PAGE_MASK", "PAGE_MASK" has an opposite convention of
> "AMDGPU_GPU_PAGE_MASK" :(.
>
>>> Yeah, good point.
>>>
>>>> I tried it and it broke userspace: Xorg startup fails with EINVAL with
>>>> this
>>>> change.
>>> Well in theory it is possible that we always fill the GPUVM on a 4k
>>> basis while the native page size of the CPU is larger. Let me double
>>> check the code.
> On my platform, there are two issues both causing the VM corruption.  One is
> fixed by agd5f/linux@fe001e7.

What is fe001e7? A commit id? If yes then that is to short and I can't 
find it.

>    Another is in Mesa from userspace:  it uses
> `dev_info->gart_page_size` as the alignment, but the kernel AMDGPU driver
> expects it to use `dev_info->virtual_address_alignment`.

Mhm, looking at the kernel code I would rather say Mesa is correct and 
the kernel driver is broken.

The gart_page_size is limited by the CPU page size, but the 
virtual_address_alignment isn't.

> If we can make the change to fill the GPUVM on a 4k basis, we can fix this issue
> and make virtual_address_alignment = 4K.  Otherwise, we should fortify the
> parameter validation, changing "AMDGPU_GPU_PAGE_MASK" to "~PAGE_MASK".  Then the
> userspace will just get an EINVAL, instead of a slient VM corruption.  And
> someone should tell Mesa developers to fix the code in this case.

I rather see this as a kernel bug. Can you test if this code fragment 
fixes your issue:

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
index 64beb3399604..e1260b517e1b 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
@@ -780,7 +780,7 @@ int amdgpu_info_ioctl(struct drm_device *dev, void 
*data, struct drm_file *filp)
                 }
                 dev_info->virtual_address_alignment = 
max((int)PAGE_SIZE, AMDGPU_GPU_PAGE_SIZE);
                 dev_info->pte_fragment_size = (1 << 
adev->vm_manager.fragment_size) * AMDGPU_GPU_PAGE_SIZE;
-               dev_info->gart_page_size = AMDGPU_GPU_PAGE_SIZE;
+               dev_info->gart_page_size = 
dev_info->virtual_address_alignment;
                 dev_info->cu_active_number = adev->gfx.cu_info.number;
                 dev_info->cu_ao_mask = adev->gfx.cu_info.ao_cu_mask;
                 dev_info->ce_ram_size = adev->gfx.ce_ram_size;


Thanks,
Christian.

> --
> Xi Ruoyao <xry111@mengyan1223.wang>
> School of Aerospace Science and Technology, Xidian University
>
Dan Horák March 30, 2021, 1 p.m. UTC | #8
On Tue, 30 Mar 2021 14:55:01 +0200
Christian König <christian.koenig@amd.com> wrote:

> Am 30.03.21 um 14:04 schrieb Xi Ruoyao:

> > On 2021-03-30 03:40 +0800, Xi Ruoyao wrote:

> >> On 2021-03-29 21:36 +0200, Christian König wrote:

> >>> Am 29.03.21 um 21:27 schrieb Xi Ruoyao:

> >>>> Hi Christian,

> >>>>

> >>>> I don't think there is any constraint implemented to ensure `num_entries %

> >>>> AMDGPU_GPU_PAGES_IN_CPU_PAGE == 0`.  For example, in `amdgpu_vm_bo_map()`:

> >>>>

> >>>>           /* validate the parameters */

> >>>>           if (saddr & AMDGPU_GPU_PAGE_MASK || offset & AMDGPU_GPU_PAGE_MASK

> >>>>               size == 0 || size & AMDGPU_GPU_PAGE_MASK)

> >>>>                   return -EINVAL;

> >>>>

> >>>> /* snip */

> >>>>

> >>>>           saddr /= AMDGPU_GPU_PAGE_SIZE;

> >>>>           eaddr /= AMDGPU_GPU_PAGE_SIZE;

> >>>>

> >>>> /* snip */

> >>>>

> >>>>           mapping->start = saddr;

> >>>>           mapping->last = eaddr;

> >>>>

> >>>>

> >>>> If we really want to ensure (mapping->last - mapping->start + 1) %

> >>>> AMDGPU_GPU_PAGES_IN_CPU_PAGE == 0, then we should replace

> >>>> "AMDGPU_GPU_PAGE_MASK"

> >>>> in "validate the parameters" with "PAGE_MASK".

> > It should be "~PAGE_MASK", "PAGE_MASK" has an opposite convention of

> > "AMDGPU_GPU_PAGE_MASK" :(.

> >

> >>> Yeah, good point.

> >>>

> >>>> I tried it and it broke userspace: Xorg startup fails with EINVAL with

> >>>> this

> >>>> change.

> >>> Well in theory it is possible that we always fill the GPUVM on a 4k

> >>> basis while the native page size of the CPU is larger. Let me double

> >>> check the code.

> > On my platform, there are two issues both causing the VM corruption.  One is

> > fixed by agd5f/linux@fe001e7.

> 

> What is fe001e7? A commit id? If yes then that is to short and I can't 

> find it.


it's a gitlab shortcut for
https://gitlab.freedesktop.org/agd5f/linux/-/commit/fe001e70a55d0378328612be1fdc3abfc68b9ccc


		Dan
> 

> >    Another is in Mesa from userspace:  it uses

> > `dev_info->gart_page_size` as the alignment, but the kernel AMDGPU driver

> > expects it to use `dev_info->virtual_address_alignment`.

> 

> Mhm, looking at the kernel code I would rather say Mesa is correct and 

> the kernel driver is broken.

> 

> The gart_page_size is limited by the CPU page size, but the 

> virtual_address_alignment isn't.

> 

> > If we can make the change to fill the GPUVM on a 4k basis, we can fix this issue

> > and make virtual_address_alignment = 4K.  Otherwise, we should fortify the

> > parameter validation, changing "AMDGPU_GPU_PAGE_MASK" to "~PAGE_MASK".  Then the

> > userspace will just get an EINVAL, instead of a slient VM corruption.  And

> > someone should tell Mesa developers to fix the code in this case.

> 

> I rather see this as a kernel bug. Can you test if this code fragment 

> fixes your issue:

> 

> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c 

> b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c

> index 64beb3399604..e1260b517e1b 100644

> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c

> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c

> @@ -780,7 +780,7 @@ int amdgpu_info_ioctl(struct drm_device *dev, void 

> *data, struct drm_file *filp)

>                  }

>                  dev_info->virtual_address_alignment = 

> max((int)PAGE_SIZE, AMDGPU_GPU_PAGE_SIZE);

>                  dev_info->pte_fragment_size = (1 << 

> adev->vm_manager.fragment_size) * AMDGPU_GPU_PAGE_SIZE;

> -               dev_info->gart_page_size = AMDGPU_GPU_PAGE_SIZE;

> +               dev_info->gart_page_size = 

> dev_info->virtual_address_alignment;

>                  dev_info->cu_active_number = adev->gfx.cu_info.number;

>                  dev_info->cu_ao_mask = adev->gfx.cu_info.ao_cu_mask;

>                  dev_info->ce_ram_size = adev->gfx.ce_ram_size;

> 

> 

> Thanks,

> Christian.

> 

> > --

> > Xi Ruoyao <xry111@mengyan1223.wang>

> > School of Aerospace Science and Technology, Xidian University

> >

>
Xi Ruoyao March 30, 2021, 1:02 p.m. UTC | #9
On 2021-03-30 14:55 +0200, Christian König wrote:
> Am 30.03.21 um 14:04 schrieb Xi Ruoyao:

> > On 2021-03-30 03:40 +0800, Xi Ruoyao wrote:

> > > On 2021-03-29 21:36 +0200, Christian König wrote:

> > > > Am 29.03.21 um 21:27 schrieb Xi Ruoyao:

> > > > > Hi Christian,

> > > > > 

> > > > > I don't think there is any constraint implemented to ensure `num_entries

> > > > > %

> > > > > AMDGPU_GPU_PAGES_IN_CPU_PAGE == 0`.  For example, in

> > > > > `amdgpu_vm_bo_map()`:

> > > > > 

> > > > >           /* validate the parameters */

> > > > >           if (saddr & AMDGPU_GPU_PAGE_MASK || offset &

> > > > > AMDGPU_GPU_PAGE_MASK

> > > > >               size == 0 || size & AMDGPU_GPU_PAGE_MASK)

> > > > >                   return -EINVAL;

> > > > > 

> > > > > /* snip */

> > > > > 

> > > > >           saddr /= AMDGPU_GPU_PAGE_SIZE;

> > > > >           eaddr /= AMDGPU_GPU_PAGE_SIZE;

> > > > > 

> > > > > /* snip */

> > > > > 

> > > > >           mapping->start = saddr;

> > > > >           mapping->last = eaddr;

> > > > > 

> > > > > 

> > > > > If we really want to ensure (mapping->last - mapping->start + 1) %

> > > > > AMDGPU_GPU_PAGES_IN_CPU_PAGE == 0, then we should replace

> > > > > "AMDGPU_GPU_PAGE_MASK"

> > > > > in "validate the parameters" with "PAGE_MASK".

> > It should be "~PAGE_MASK", "PAGE_MASK" has an opposite convention of

> > "AMDGPU_GPU_PAGE_MASK" :(.

> > 

> > > > Yeah, good point.

> > > > 

> > > > > I tried it and it broke userspace: Xorg startup fails with EINVAL with

> > > > > this

> > > > > change.

> > > > Well in theory it is possible that we always fill the GPUVM on a 4k

> > > > basis while the native page size of the CPU is larger. Let me double

> > > > check the code.

> > On my platform, there are two issues both causing the VM corruption.  One is

> > fixed by agd5f/linux@fe001e7.

> 

> What is fe001e7? A commit id? If yes then that is to short and I can't 

> find it.

> 

> >    Another is in Mesa from userspace:  it uses

> > `dev_info->gart_page_size` as the alignment, but the kernel AMDGPU driver

> > expects it to use `dev_info->virtual_address_alignment`.

> 

> Mhm, looking at the kernel code I would rather say Mesa is correct and 

> the kernel driver is broken.

> 

> The gart_page_size is limited by the CPU page size, but the 

> virtual_address_alignment isn't.

> 

> > If we can make the change to fill the GPUVM on a 4k basis, we can fix this

> > issue

> > and make virtual_address_alignment = 4K.  Otherwise, we should fortify the

> > parameter validation, changing "AMDGPU_GPU_PAGE_MASK" to "~PAGE_MASK".  Then

> > the

> > userspace will just get an EINVAL, instead of a slient VM corruption.  And

> > someone should tell Mesa developers to fix the code in this case.

> 

> I rather see this as a kernel bug. Can you test if this code fragment 

> fixes your issue:

> 

> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c 

> b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c

> index 64beb3399604..e1260b517e1b 100644

> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c

> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c

> @@ -780,7 +780,7 @@ int amdgpu_info_ioctl(struct drm_device *dev, void 

> *data, struct drm_file *filp)

>                  }

>                  dev_info->virtual_address_alignment = 

> max((int)PAGE_SIZE, AMDGPU_GPU_PAGE_SIZE);

>                  dev_info->pte_fragment_size = (1 << 

> adev->vm_manager.fragment_size) * AMDGPU_GPU_PAGE_SIZE;

> -               dev_info->gart_page_size = AMDGPU_GPU_PAGE_SIZE;

> +               dev_info->gart_page_size = 

> dev_info->virtual_address_alignment;

>                  dev_info->cu_active_number = adev->gfx.cu_info.number;

>                  dev_info->cu_ao_mask = adev->gfx.cu_info.ao_cu_mask;

>                  dev_info->ce_ram_size = adev->gfx.ce_ram_size;


It works.  I've seen it at
https://github.com/xen0n/linux/commit/84ada72983838bd7ce54bc32f5d34ac5b5aae191
before (with a common sub-expression, though :).

-- 
Xi Ruoyao <xry111@mengyan1223.wang>
School of Aerospace Science and Technology, Xidian University
Christian König March 30, 2021, 1:02 p.m. UTC | #10
Am 30.03.21 um 15:00 schrieb Dan Horák:
> On Tue, 30 Mar 2021 14:55:01 +0200
> Christian König <christian.koenig@amd.com> wrote:
>
>> Am 30.03.21 um 14:04 schrieb Xi Ruoyao:
>>> On 2021-03-30 03:40 +0800, Xi Ruoyao wrote:
>>>> On 2021-03-29 21:36 +0200, Christian König wrote:
>>>>> Am 29.03.21 um 21:27 schrieb Xi Ruoyao:
>>>>>> Hi Christian,
>>>>>>
>>>>>> I don't think there is any constraint implemented to ensure `num_entries %
>>>>>> AMDGPU_GPU_PAGES_IN_CPU_PAGE == 0`.  For example, in `amdgpu_vm_bo_map()`:
>>>>>>
>>>>>>            /* validate the parameters */
>>>>>>            if (saddr & AMDGPU_GPU_PAGE_MASK || offset & AMDGPU_GPU_PAGE_MASK
>>>>>>                size == 0 || size & AMDGPU_GPU_PAGE_MASK)
>>>>>>                    return -EINVAL;
>>>>>>
>>>>>> /* snip */
>>>>>>
>>>>>>            saddr /= AMDGPU_GPU_PAGE_SIZE;
>>>>>>            eaddr /= AMDGPU_GPU_PAGE_SIZE;
>>>>>>
>>>>>> /* snip */
>>>>>>
>>>>>>            mapping->start = saddr;
>>>>>>            mapping->last = eaddr;
>>>>>>
>>>>>>
>>>>>> If we really want to ensure (mapping->last - mapping->start + 1) %
>>>>>> AMDGPU_GPU_PAGES_IN_CPU_PAGE == 0, then we should replace
>>>>>> "AMDGPU_GPU_PAGE_MASK"
>>>>>> in "validate the parameters" with "PAGE_MASK".
>>> It should be "~PAGE_MASK", "PAGE_MASK" has an opposite convention of
>>> "AMDGPU_GPU_PAGE_MASK" :(.
>>>
>>>>> Yeah, good point.
>>>>>
>>>>>> I tried it and it broke userspace: Xorg startup fails with EINVAL with
>>>>>> this
>>>>>> change.
>>>>> Well in theory it is possible that we always fill the GPUVM on a 4k
>>>>> basis while the native page size of the CPU is larger. Let me double
>>>>> check the code.
>>> On my platform, there are two issues both causing the VM corruption.  One is
>>> fixed by agd5f/linux@fe001e7.
>> What is fe001e7? A commit id? If yes then that is to short and I can't
>> find it.
> it's a gitlab shortcut for
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitlab.freedesktop.org%2Fagd5f%2Flinux%2F-%2Fcommit%2Ffe001e70a55d0378328612be1fdc3abfc68b9ccc&amp;data=04%7C01%7Cchristian.koenig%40amd.com%7Cd16d123aaa01420ebd0808d8f37bbf2f%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637527060812278536%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=5rFVLxSRnfHUGjhoiqx1e6SeROqbg4ZPef%2BxEvgv%2BTg%3D&amp;reserved=0

Ah! Yes I would expect that this patch is fixing something in this use case.

Thanks,
Christian.

>
>
> 		Dan
>>>     Another is in Mesa from userspace:  it uses
>>> `dev_info->gart_page_size` as the alignment, but the kernel AMDGPU driver
>>> expects it to use `dev_info->virtual_address_alignment`.
>> Mhm, looking at the kernel code I would rather say Mesa is correct and
>> the kernel driver is broken.
>>
>> The gart_page_size is limited by the CPU page size, but the
>> virtual_address_alignment isn't.
>>
>>> If we can make the change to fill the GPUVM on a 4k basis, we can fix this issue
>>> and make virtual_address_alignment = 4K.  Otherwise, we should fortify the
>>> parameter validation, changing "AMDGPU_GPU_PAGE_MASK" to "~PAGE_MASK".  Then the
>>> userspace will just get an EINVAL, instead of a slient VM corruption.  And
>>> someone should tell Mesa developers to fix the code in this case.
>> I rather see this as a kernel bug. Can you test if this code fragment
>> fixes your issue:
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
>> index 64beb3399604..e1260b517e1b 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
>> @@ -780,7 +780,7 @@ int amdgpu_info_ioctl(struct drm_device *dev, void
>> *data, struct drm_file *filp)
>>                   }
>>                   dev_info->virtual_address_alignment =
>> max((int)PAGE_SIZE, AMDGPU_GPU_PAGE_SIZE);
>>                   dev_info->pte_fragment_size = (1 <<
>> adev->vm_manager.fragment_size) * AMDGPU_GPU_PAGE_SIZE;
>> -               dev_info->gart_page_size = AMDGPU_GPU_PAGE_SIZE;
>> +               dev_info->gart_page_size =
>> dev_info->virtual_address_alignment;
>>                   dev_info->cu_active_number = adev->gfx.cu_info.number;
>>                   dev_info->cu_ao_mask = adev->gfx.cu_info.ao_cu_mask;
>>                   dev_info->ce_ram_size = adev->gfx.ce_ram_size;
>>
>>
>> Thanks,
>> Christian.
>>
>>> --
>>> Xi Ruoyao <xry111@mengyan1223.wang>
>>> School of Aerospace Science and Technology, Xidian University
>>>
Xi Ruoyao March 30, 2021, 1:09 p.m. UTC | #11
On 2021-03-30 21:02 +0800, Xi Ruoyao wrote:
> On 2021-03-30 14:55 +0200, Christian König wrote:

> > 

> > I rather see this as a kernel bug. Can you test if this code fragment 

> > fixes your issue:

> > 

> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c 

> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c

> > index 64beb3399604..e1260b517e1b 100644

> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c

> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c

> > @@ -780,7 +780,7 @@ int amdgpu_info_ioctl(struct drm_device *dev, void 

> > *data, struct drm_file *filp)

> >                  }

> >                  dev_info->virtual_address_alignment = 

> > max((int)PAGE_SIZE, AMDGPU_GPU_PAGE_SIZE);

> >                  dev_info->pte_fragment_size = (1 << 

> > adev->vm_manager.fragment_size) * AMDGPU_GPU_PAGE_SIZE;

> > -               dev_info->gart_page_size = AMDGPU_GPU_PAGE_SIZE;

> > +               dev_info->gart_page_size = 

> > dev_info->virtual_address_alignment;

> >                  dev_info->cu_active_number = adev->gfx.cu_info.number;

> >                  dev_info->cu_ao_mask = adev->gfx.cu_info.ao_cu_mask;

> >                  dev_info->ce_ram_size = adev->gfx.ce_ram_size;

> 

> It works.  I've seen it at

> https://github.com/xen0n/linux/commit/84ada72983838bd7ce54bc32f5d34ac5b5aae191

> before (with a common sub-expression, though :).


Some comment: on an old version of Fedora ported by Loongson, Xorg just hangs
without this commit.  But on the system I built from source, I didn't see any
issue before Linux 5.11.  So I misbelieved that it was something already fixed.

Dan: you can try it on your PPC 64 with non-4K page as well.
-- 
Xi Ruoyao <xry111@mengyan1223.wang>
School of Aerospace Science and Technology, Xidian University
Dan Horák March 30, 2021, 1:23 p.m. UTC | #12
On Tue, 30 Mar 2021 21:09:12 +0800
Xi Ruoyao <xry111@mengyan1223.wang> wrote:

> On 2021-03-30 21:02 +0800, Xi Ruoyao wrote:

> > On 2021-03-30 14:55 +0200, Christian König wrote:

> > > 

> > > I rather see this as a kernel bug. Can you test if this code fragment 

> > > fixes your issue:

> > > 

> > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c 

> > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c

> > > index 64beb3399604..e1260b517e1b 100644

> > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c

> > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c

> > > @@ -780,7 +780,7 @@ int amdgpu_info_ioctl(struct drm_device *dev, void 

> > > *data, struct drm_file *filp)

> > >                  }

> > >                  dev_info->virtual_address_alignment = 

> > > max((int)PAGE_SIZE, AMDGPU_GPU_PAGE_SIZE);

> > >                  dev_info->pte_fragment_size = (1 << 

> > > adev->vm_manager.fragment_size) * AMDGPU_GPU_PAGE_SIZE;

> > > -               dev_info->gart_page_size = AMDGPU_GPU_PAGE_SIZE;

> > > +               dev_info->gart_page_size = 

> > > dev_info->virtual_address_alignment;

> > >                  dev_info->cu_active_number = adev->gfx.cu_info.number;

> > >                  dev_info->cu_ao_mask = adev->gfx.cu_info.ao_cu_mask;

> > >                  dev_info->ce_ram_size = adev->gfx.ce_ram_size;

> > 

> > It works.  I've seen it at

> > https://github.com/xen0n/linux/commit/84ada72983838bd7ce54bc32f5d34ac5b5aae191

> > before (with a common sub-expression, though :).

> 

> Some comment: on an old version of Fedora ported by Loongson, Xorg just hangs

> without this commit.  But on the system I built from source, I didn't see any

> issue before Linux 5.11.  So I misbelieved that it was something already fixed.

> 

> Dan: you can try it on your PPC 64 with non-4K page as well.


yup, looks good here as well, ppc64le (Power9) system with 64KB pages


		Dan
Christian König March 30, 2021, 1:24 p.m. UTC | #13
Am 30.03.21 um 15:23 schrieb Dan Horák:
> On Tue, 30 Mar 2021 21:09:12 +0800
> Xi Ruoyao <xry111@mengyan1223.wang> wrote:
>
>> On 2021-03-30 21:02 +0800, Xi Ruoyao wrote:
>>> On 2021-03-30 14:55 +0200, Christian König wrote:
>>>> I rather see this as a kernel bug. Can you test if this code fragment
>>>> fixes your issue:
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
>>>> index 64beb3399604..e1260b517e1b 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
>>>> @@ -780,7 +780,7 @@ int amdgpu_info_ioctl(struct drm_device *dev, void
>>>> *data, struct drm_file *filp)
>>>>                   }
>>>>                   dev_info->virtual_address_alignment =
>>>> max((int)PAGE_SIZE, AMDGPU_GPU_PAGE_SIZE);
>>>>                   dev_info->pte_fragment_size = (1 <<
>>>> adev->vm_manager.fragment_size) * AMDGPU_GPU_PAGE_SIZE;
>>>> -               dev_info->gart_page_size = AMDGPU_GPU_PAGE_SIZE;
>>>> +               dev_info->gart_page_size =
>>>> dev_info->virtual_address_alignment;
>>>>                   dev_info->cu_active_number = adev->gfx.cu_info.number;
>>>>                   dev_info->cu_ao_mask = adev->gfx.cu_info.ao_cu_mask;
>>>>                   dev_info->ce_ram_size = adev->gfx.ce_ram_size;
>>> It works.  I've seen it at
>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fxen0n%2Flinux%2Fcommit%2F84ada72983838bd7ce54bc32f5d34ac5b5aae191&amp;data=04%7C01%7Cchristian.koenig%40amd.com%7Cf37fddf20a8847edf67808d8f37ef23c%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637527074118791321%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=DZnmee38NGpiWRMX5LmlxOhxAzBMhAusnAWNnCxXTJ0%3D&amp;reserved=0
>>> before (with a common sub-expression, though :).
>> Some comment: on an old version of Fedora ported by Loongson, Xorg just hangs
>> without this commit.  But on the system I built from source, I didn't see any
>> issue before Linux 5.11.  So I misbelieved that it was something already fixed.
>>
>> Dan: you can try it on your PPC 64 with non-4K page as well.
> yup, looks good here as well, ppc64le (Power9) system with 64KB pages

Mhm, as far as I can say this patch never made it to us.

Can you please send it to the mailing list with me on CC?

Thanks,
Christian.

>
>
> 		Dan
Xi Ruoyao March 30, 2021, 3:46 p.m. UTC | #14
On 2021-03-30 15:24 +0200, Christian König wrote:
> Am 30.03.21 um 15:23 schrieb Dan Horák:
> > On Tue, 30 Mar 2021 21:09:12 +0800
> > Xi Ruoyao <xry111@mengyan1223.wang> wrote:
> > 
> > > On 2021-03-30 21:02 +0800, Xi Ruoyao wrote:
> > > > On 2021-03-30 14:55 +0200, Christian König wrote:
> > > > > I rather see this as a kernel bug. Can you test if this code fragment
> > > > > fixes your issue:
> > > > > 
> > > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
> > > > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
> > > > > index 64beb3399604..e1260b517e1b 100644
> > > > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
> > > > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
> > > > > @@ -780,7 +780,7 @@ int amdgpu_info_ioctl(struct drm_device *dev, void
> > > > > *data, struct drm_file *filp)
> > > > >                   }
> > > > >                   dev_info->virtual_address_alignment =
> > > > > max((int)PAGE_SIZE, AMDGPU_GPU_PAGE_SIZE);
> > > > >                   dev_info->pte_fragment_size = (1 <<
> > > > > adev->vm_manager.fragment_size) * AMDGPU_GPU_PAGE_SIZE;
> > > > > -               dev_info->gart_page_size = AMDGPU_GPU_PAGE_SIZE;
> > > > > +               dev_info->gart_page_size =
> > > > > dev_info->virtual_address_alignment;
> > > > >                   dev_info->cu_active_number = adev-
> > > > > >gfx.cu_info.number;
> > > > >                   dev_info->cu_ao_mask = adev->gfx.cu_info.ao_cu_mask;
> > > > >                   dev_info->ce_ram_size = adev->gfx.ce_ram_size;
> > > > It works.  I've seen it at
> > > > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fxen0n%2Flinux%2Fcommit%2F84ada72983838bd7ce54bc32f5d34ac5b5aae191&amp;data=04%7C01%7Cchristian.koenig%40amd.com%7Cf37fddf20a8847edf67808d8f37ef23c%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637527074118791321%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=DZnmee38NGpiWRMX5LmlxOhxAzBMhAusnAWNnCxXTJ0%3D&amp;reserved=0
> > > > before (with a common sub-expression, though :).
> > > Some comment: on an old version of Fedora ported by Loongson, Xorg just
> > > hangs
> > > without this commit.  But on the system I built from source, I didn't see
> > > any
> > > issue before Linux 5.11.  So I misbelieved that it was something already
> > > fixed.
> > > 
> > > Dan: you can try it on your PPC 64 with non-4K page as well.
> > yup, looks good here as well, ppc64le (Power9) system with 64KB pages
> 
> Mhm, as far as I can say this patch never made it to us.

I think maybe its author considers it a "workaround" like me, as there is
already a "virtual_address_alignment" which seems correct.

> Can you please send it to the mailing list with me on CC?

I've sent it, together with my patch for using ~PAGE_MASK in parameter
validation.
diff mbox series

Patch

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
index ad91c0c3c423..cee0cc9c8085 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
@@ -1707,7 +1707,7 @@  static int amdgpu_vm_bo_update_mapping(struct amdgpu_device *adev,
 		}
 		start = tmp;
 
-	} while (unlikely(start != last + 1));
+	} while (unlikely(start < last + 1));
 
 	r = vm->update_funcs->commit(&params, fence);