diff mbox series

[v3,3/4] fbdev/deferred-io: Support contiguous kernel memory framebuffers

Message ID 20250523161522.409504-4-mhklinux@outlook.com
State New
Headers show
Series fbdev: Add deferred I/O support for contiguous kernel memory framebuffers | expand

Commit Message

mhkelley58@gmail.com May 23, 2025, 4:15 p.m. UTC
From: Michael Kelley <mhklinux@outlook.com>

Current defio code works only for framebuffer memory that is allocated
with vmalloc(). The code assumes that the underlying page refcount can
be used by the mm subsystem to manage each framebuffer page's lifecycle,
including freeing the page if the refcount goes to 0. This approach is
consistent with vmalloc'ed memory, but not with contiguous kernel memory
allocated via alloc_pages() or similar. The latter such memory pages
usually have a refcount of 0 when allocated, and would be incorrectly
freed page-by-page if used with defio. That free'ing corrupts the memory
free lists and Linux eventually panics. Simply bumping the refcount after
allocation doesn’t work because when the framebuffer memory is freed,
__free_pages() complains about non-zero refcounts.

Commit 37b4837959cb ("video: deferred io with physically contiguous
memory") from the year 2008 purported to add support for contiguous
kernel memory framebuffers. The motivating device, sh_mobile_lcdcfb, uses
dma_alloc_coherent() to allocate framebuffer memory, which is likely to
use alloc_pages(). It's unclear to me how this commit actually worked at
the time, unless dma_alloc_coherent() was pulling from a CMA pool instead
of alloc_pages(). Or perhaps alloc_pages() worked differently or on the
arm32 architecture on which sh_mobile_lcdcfb is used.

In any case, for x86 and arm64 today, commit 37b4837959cb9 is not
sufficient to support contiguous kernel memory framebuffers. The problem
can be seen with the hyperv_fb driver, which may allocate the framebuffer
memory using vmalloc() or alloc_pages(), depending on the configuration
of the Hyper-V guest VM (Gen 1 vs. Gen 2) and the size of the framebuffer.

Fix this limitation by adding defio support for contiguous kernel memory
framebuffers. A driver with a framebuffer allocated from contiguous
kernel memory must set the FBINFO_KMEMFB flag to indicate such.

Tested with the hyperv_fb driver in both configurations -- with a vmalloc()
framebuffer and with an alloc_pages() framebuffer on x86. Also verified a
vmalloc() framebuffer on arm64. Hardware is not available to me to verify
that the older arm32 devices still work correctly, but the path for
vmalloc() framebuffers is essentially unchanged.

Even with these changes, defio does not support framebuffers in MMIO
space, as defio code depends on framebuffer memory pages having
corresponding 'struct page's.

Fixes: 3a6fb6c4255c ("video: hyperv: hyperv_fb: Use physical memory for fb on HyperV Gen 1 VMs.")
Signed-off-by: Michael Kelley <mhklinux@outlook.com>
---
Changes in v3:
* Moved definition of FBINFO_KMEMFB flag to a separate patch
  preceeding this one in the patch set [Helge Deller]
Changes in v2:
* Tweaked code comments regarding framebuffers allocated with
  dma_alloc_coherent() [Christoph Hellwig]

 drivers/video/fbdev/core/fb_defio.c | 128 +++++++++++++++++++++++-----
 1 file changed, 108 insertions(+), 20 deletions(-)

Comments

kernel test robot May 24, 2025, 7:28 a.m. UTC | #1
Hi,

kernel test robot noticed the following build errors:

[auto build test ERROR on akpm-mm/mm-everything]
[also build test ERROR on linus/master v6.15-rc7 next-20250523]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/mhkelley58-gmail-com/mm-Export-vmf_insert_mixed_mkwrite/20250524-001707
base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link:    https://lore.kernel.org/r/20250523161522.409504-4-mhklinux%40outlook.com
patch subject: [PATCH v3 3/4] fbdev/deferred-io: Support contiguous kernel memory framebuffers
config: arm-randconfig-001-20250524 (https://download.01.org/0day-ci/archive/20250524/202505241553.VSXoFOvX-lkp@intel.com/config)
compiler: arm-linux-gnueabi-gcc (GCC) 7.5.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250524/202505241553.VSXoFOvX-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202505241553.VSXoFOvX-lkp@intel.com/

All errors (new ones prefixed by >>, old ones prefixed by <<):

>> ERROR: modpost: "vmf_insert_pfn" [drivers/video/fbdev/core/fb.ko] undefined!
>> ERROR: modpost: "vmf_insert_mixed_mkwrite" [drivers/video/fbdev/core/fb.ko] undefined!
Christoph Hellwig May 26, 2025, 6:54 a.m. UTC | #2
On Fri, May 23, 2025 at 09:15:21AM -0700, mhkelley58@gmail.com wrote:
> Commit 37b4837959cb ("video: deferred io with physically contiguous
> memory") from the year 2008 purported to add support for contiguous
> kernel memory framebuffers. The motivating device, sh_mobile_lcdcfb, uses
> dma_alloc_coherent() to allocate framebuffer memory, which is likely to
> use alloc_pages(). It's unclear to me how this commit actually worked at
> the time, unless dma_alloc_coherent() was pulling from a CMA pool instead
> of alloc_pages(). Or perhaps alloc_pages() worked differently or on the
> arm32 architecture on which sh_mobile_lcdcfb is used.
> 
> In any case, for x86 and arm64 today, commit 37b4837959cb9 is not
> sufficient to support contiguous kernel memory framebuffers. The problem
> can be seen with the hyperv_fb driver, which may allocate the framebuffer
> memory using vmalloc() or alloc_pages(), depending on the configuration
> of the Hyper-V guest VM (Gen 1 vs. Gen 2) and the size of the framebuffer.

That langugage is far too nice.  The existing users of fb_defio are
all gravely broken because they violate the dma API restriction to
not poke into the memory.  You can't speculate what you get from
dma_alloc_coherent and it can change behind you all the time.

> Fix this limitation by adding defio support for contiguous kernel memory
> framebuffers. A driver with a framebuffer allocated from contiguous
> kernel memory must set the FBINFO_KMEMFB flag to indicate such.

Honestly, the right thing is to invert the flag.  What hypervs is doing
here - take kernel memory in the direct mapping or from vmalloc is fine.

What others drivers do it completely broken crap.  So add a flag
FBINFO_BROKEN_CRAP to maybe keep the guessing.  Or just disable it
because it's dangerous.
Thomas Zimmermann June 3, 2025, 6:25 a.m. UTC | #3
Hi

Am 03.06.25 um 03:49 schrieb Michael Kelley:
[...]
>> Will the VMA have VM_PFNMAP or VM_MIXEDMAP set? PFN_SPECIAL is a
>> horrible hack.
>>
>> In another thread, you mention that you use PFN_SPECIAL to bypass the
>> check in vm_mixed_ok(), so VM_MIXEDMAP is likely not set?
> The VMA has VM_PFNMAP set, not VM_MIXEDMAP.  It seemed like
> VM_MIXEDMAP is somewhat of a superset of VM_PFNMAP, but maybe that's
> a wrong impression. vm_mixed_ok() does a thorough job of validating the
> use of __vm_insert_mixed(), and since what I did was allowed, I thought
> perhaps it was OK. Your feedback has set me straight, and that's what I
> needed. :-)
>
> But the whole approach is moot with Alistair Popple's patch set that
> eliminates pfn_t. Is there an existing mm API that will do mkwrite on a
> special PTE in a VM_PFNMAP VMA? I didn't see one, but maybe I missed
> it. If there's not one, I'll take a crack at adding it in the next version of my
> patch set.

What is the motivation behind this work? The driver or fbdev as a whole 
does not have much of a future anyway.

I'd like to suggest removing hyperv_fb entirely in favor of hypervdrm?

Best regards
Thomas

>
> Michael
Michael Kelley June 3, 2025, 5:50 p.m. UTC | #4
From: Thomas Zimmermann <tzimmermann@suse.de> Sent: Monday, June 2, 2025 11:25 PM
> 
> Hi
> 
> Am 03.06.25 um 03:49 schrieb Michael Kelley:
> [...]
> >> Will the VMA have VM_PFNMAP or VM_MIXEDMAP set? PFN_SPECIAL is a
> >> horrible hack.
> >>
> >> In another thread, you mention that you use PFN_SPECIAL to bypass the
> >> check in vm_mixed_ok(), so VM_MIXEDMAP is likely not set?
> > The VMA has VM_PFNMAP set, not VM_MIXEDMAP.  It seemed like
> > VM_MIXEDMAP is somewhat of a superset of VM_PFNMAP, but maybe that's
> > a wrong impression. vm_mixed_ok() does a thorough job of validating the
> > use of __vm_insert_mixed(), and since what I did was allowed, I thought
> > perhaps it was OK. Your feedback has set me straight, and that's what I
> > needed. :-)
> >
> > But the whole approach is moot with Alistair Popple's patch set that
> > eliminates pfn_t. Is there an existing mm API that will do mkwrite on a
> > special PTE in a VM_PFNMAP VMA? I didn't see one, but maybe I missed
> > it. If there's not one, I'll take a crack at adding it in the next version of my
> > patch set.
> 
> What is the motivation behind this work? The driver or fbdev as a whole
> does not have much of a future anyway.
> 
> I'd like to suggest removing hyperv_fb entirely in favor of hypervdrm?
> 

Yes, I think that's the longer term direction. A couple months ago I had an
email conversation with Saurabh Sengar from the Microsoft Linux team where
he raised this idea. I think the Microsoft folks will need to drive the deprecation
process, as they need to coordinate with the distro vendors who publish
images for running on local Hyper-V and in the Azure cloud. And my
understanding is that the Linux kernel process would want the driver to
be available but marked "deprecated" for a year or so before it actually
goes away.

I do have some concerns about the maturity of the hyperv_drm driver
"around the edges". For example, somebody just recently submitted a
patch to flush output on panic. I have less familiarity hyperv_drm vs.
hyperv_fb, so some of my concern is probably due to that. We might
need to do review of hyperv_drm and see if there's anything else to
deal with before hyperv_fb goes away.

This all got started when I was looking at a problem with hyperv_fb,
and I found several other related problems, some of which also existed
in hyperv_drm. You've seen several small'ish fixes from me and Saurabh
as a result, and this issue with mmap()'ing /dev/fb0 is the last one of that
set. This fix is definitely a bit bigger, but it's the right fix. On the flip side, 
if we really get on a path to deprecate hyperv_fb, there are hack fixes for
the mmap problem that are smaller and contained to hyperv_fb. I would
be OK with a hack fix in that case.

Michael
Simona Vetter June 4, 2025, 2:45 p.m. UTC | #5
On Wed, Jun 04, 2025 at 10:12:45AM +0200, Thomas Zimmermann wrote:
> Hi
> 
> Am 03.06.25 um 19:50 schrieb Michael Kelley:
> > From: Thomas Zimmermann <tzimmermann@suse.de> Sent: Monday, June 2, 2025 11:25 PM
> > > Hi
> > > 
> > > Am 03.06.25 um 03:49 schrieb Michael Kelley:
> > > [...]
> > > > > Will the VMA have VM_PFNMAP or VM_MIXEDMAP set? PFN_SPECIAL is a
> > > > > horrible hack.
> > > > > 
> > > > > In another thread, you mention that you use PFN_SPECIAL to bypass the
> > > > > check in vm_mixed_ok(), so VM_MIXEDMAP is likely not set?
> > > > The VMA has VM_PFNMAP set, not VM_MIXEDMAP.  It seemed like
> > > > VM_MIXEDMAP is somewhat of a superset of VM_PFNMAP, but maybe that's
> > > > a wrong impression. vm_mixed_ok() does a thorough job of validating the
> > > > use of __vm_insert_mixed(), and since what I did was allowed, I thought
> > > > perhaps it was OK. Your feedback has set me straight, and that's what I
> > > > needed. :-)
> > > > 
> > > > But the whole approach is moot with Alistair Popple's patch set that
> > > > eliminates pfn_t. Is there an existing mm API that will do mkwrite on a
> > > > special PTE in a VM_PFNMAP VMA? I didn't see one, but maybe I missed
> > > > it. If there's not one, I'll take a crack at adding it in the next version of my
> > > > patch set.
> > > What is the motivation behind this work? The driver or fbdev as a whole
> > > does not have much of a future anyway.
> > > 
> > > I'd like to suggest removing hyperv_fb entirely in favor of hypervdrm?
> > > 
> > Yes, I think that's the longer term direction. A couple months ago I had an
> > email conversation with Saurabh Sengar from the Microsoft Linux team where
> > he raised this idea. I think the Microsoft folks will need to drive the deprecation
> > process, as they need to coordinate with the distro vendors who publish
> > images for running on local Hyper-V and in the Azure cloud. And my
> > understanding is that the Linux kernel process would want the driver to
> > be available but marked "deprecated" for a year or so before it actually
> > goes away.
> 
> We (DRM upstream) recently considered moving some fbdev drivers to
> drivers/staging or marking them with !DRM if a DRM driver is available.
> Hyverv_fb would be a candidate.
> 
> At least at SUSE, we ship hypervdrm instead of hyperv_fb. This works well on
> the various generations of the hyperv system. Much of our userspace would
> not be able to use hyperv_fb anyway.

Yeah investing into fbdev drivers, especially when some mm surgery seems
needed, does not sound like a good idea to me overall.

> > I do have some concerns about the maturity of the hyperv_drm driver
> > "around the edges". For example, somebody just recently submitted a
> > patch to flush output on panic. I have less familiarity hyperv_drm vs.
> > hyperv_fb, so some of my concern is probably due to that. We might
> > need to do review of hyperv_drm and see if there's anything else to
> > deal with before hyperv_fb goes away.
> 
> The panic output is a feature that we recently added to the kernel. It
> allows a DRM driver to display a final error message in the case of a kernel
> panic (think of blue screens on Windows). Drivers require a minimum of
> support to make it work. That's what the hypervdrm patches were about.

I'm also happy to help with any other issues and shortfalls of drm vs
fbdev. There are some, but I thought it was mostly around some of the low
bit color formats that really old devices want, and not anything that
hyperv would need.

Cheers, Sima
> 
> Best regards
> Thomas
> 
> > 
> > This all got started when I was looking at a problem with hyperv_fb,
> > and I found several other related problems, some of which also existed
> > in hyperv_drm. You've seen several small'ish fixes from me and Saurabh
> > as a result, and this issue with mmap()'ing /dev/fb0 is the last one of that
> > set. This fix is definitely a bit bigger, but it's the right fix. On the flip side,
> > if we really get on a path to deprecate hyperv_fb, there are hack fixes for
> > the mmap problem that are smaller and contained to hyperv_fb. I would
> > be OK with a hack fix in that case.
> > 
> > Michael
> 
> -- 
> --
> Thomas Zimmermann
> Graphics Driver Developer
> SUSE Software Solutions Germany GmbH
> Frankenstrasse 146, 90461 Nuernberg, Germany
> GF: Ivo Totev, Andrew Myers, Andrew McDonald, Boudien Moerman
> HRB 36809 (AG Nuernberg)
>
Michael Kelley June 4, 2025, 9:58 p.m. UTC | #6
From: Michael Kelley <mhklinux@outlook.com> Sent: Tuesday, June 3, 2025 10:25 AM
> 
> From: David Hildenbrand <david@redhat.com> Sent: Tuesday, June 3, 2025 12:55 AM
> >
> > On 03.06.25 03:49, Michael Kelley wrote:
> > > From: David Hildenbrand <david@redhat.com> Sent: Monday, June 2, 2025 2:48 AM
> > >>

[snip]

> > >>> @@ -182,20 +221,34 @@ static vm_fault_t fb_deferred_io_track_page(struct fb_info *info, unsigned long
> > >>>    	}
> > >>>
> > >>>    	/*
> > >>> -	 * We want the page to remain locked from ->page_mkwrite until
> > >>> -	 * the PTE is marked dirty to avoid mapping_wrprotect_range()
> > >>> -	 * being called before the PTE is updated, which would leave
> > >>> -	 * the page ignored by defio.
> > >>> -	 * Do this by locking the page here and informing the caller
> > >>> -	 * about it with VM_FAULT_LOCKED.
> > >>> +	 * The PTE must be marked writable before the defio deferred work runs
> > >>> +	 * again and potentially marks the PTE write-protected. If the order
> > >>> +	 * should be switched, the PTE would become writable without defio
> > >>> +	 * tracking the page, leaving the page forever ignored by defio.
> > >>> +	 *
> > >>> +	 * For vmalloc() framebuffers, the associated struct page is locked
> > >>> +	 * before releasing the defio lock. mm will later mark the PTE writaable
> > >>> +	 * and release the struct page lock. The struct page lock prevents
> > >>> +	 * the page from being prematurely being marked write-protected.
> > >>> +	 *
> > >>> +	 * For FBINFO_KMEMFB framebuffers, mm assumes there is no struct page,
> > >>> +	 * so the PTE must be marked writable while the defio lock is held.
> > >>>    	 */
> > >>> -	lock_page(pageref->page);
> > >>> +	if (info->flags & FBINFO_KMEMFB) {
> > >>> +		unsigned long pfn = page_to_pfn(pageref->page);
> > >>> +
> > >>> +		ret = vmf_insert_mixed_mkwrite(vmf->vma, vmf->address,
> > >>> +					       __pfn_to_pfn_t(pfn, PFN_SPECIAL));
> > >>
> > >> Will the VMA have VM_PFNMAP or VM_MIXEDMAP set? PFN_SPECIAL is a
> > >> horrible hack.
> > >>
> > >> In another thread, you mention that you use PFN_SPECIAL to bypass the
> > >> check in vm_mixed_ok(), so VM_MIXEDMAP is likely not set?
> > >
> > > The VMA has VM_PFNMAP set, not VM_MIXEDMAP.  It seemed like
> > > VM_MIXEDMAP is somewhat of a superset of VM_PFNMAP, but maybe that's
> > > a wrong impression.
> >
> > VM_PFNMAP: nothing is refcounted except anon pages
> >
> > VM_MIXEDMAP: anything with a "struct page" (pfn_valid()) is refcounted
> >
> > pte_special() is a way for GUP-fast to distinguish these refcounted (can
> > GUP) from non-refcounted (camnnot GUP) pages mapped by PTEs without any
> > locks or the VMA being available.
> >
> > Setting pte_special() in VM_MIXEDMAP on ptes that have a "struct page"
> > (pfn_valid()) is likely very bogus.
> 
> OK, good to know.
> 
> >
> > > vm_mixed_ok() does a thorough job of validating the
> > > use of __vm_insert_mixed(), and since what I did was allowed, I thought
> > > perhaps it was OK. Your feedback has set me straight, and that's what I
> > > needed. :-)
> >
> > What exactly are you trying to achieve? :)
> >
> > If it's mapping a page with a "struct page" and *not* refcounting it,
> > then vmf_insert_pfn() is the current way to achieve that in a VM_PFNMAP
> > mapping. It will set pte_special() automatically for you.
> >
> 
> Yes, that's what I'm using to initially create the special PTE in the
> .fault callback.
> 
> > >
> > > But the whole approach is moot with Alistair Popple's patch set that
> > > eliminates pfn_t. Is there an existing mm API that will do mkwrite on a
> > > special PTE in a VM_PFNMAP VMA? I didn't see one, but maybe I missed
> > > it. If there's not one, I'll take a crack at adding it in the next version of my
> > > patch set.
> >
> > I assume you'd want vmf_insert_pfn_mkwrite(), correct? Probably
> > vmf_insert_pfn_prot() can be used by adding PAGE_WRITE to pgprot. (maybe
> > :) )
> 
> Ok, I'll look at that more closely. The sequence is that the special
> PTE gets created with vmf_insert_pfn(). Then when the page is first
> written to, the .pfn_mkwrite callback is invoked by mm. The question
> is the best way for that callback to mark the existing PTE as writable.
> 

FWIW, vmf_insert_pfn_prot() won't work. It calls insert_pfn() with
the "mkwrite" parameter set to 'false', in which case insert_pfn()
does nothing if the PTE already exists.

So I would need to create a new API that does appropriate validation
for a VM_PFNMAP VMA, and then calls insert_pfn() with the "mkwrite"
parameter set to 'true'.

Michael
David Hildenbrand June 5, 2025, 8:10 a.m. UTC | #7
On 04.06.25 23:58, Michael Kelley wrote:
> From: Michael Kelley <mhklinux@outlook.com> Sent: Tuesday, June 3, 2025 10:25 AM
>>
>> From: David Hildenbrand <david@redhat.com> Sent: Tuesday, June 3, 2025 12:55 AM
>>>
>>> On 03.06.25 03:49, Michael Kelley wrote:
>>>> From: David Hildenbrand <david@redhat.com> Sent: Monday, June 2, 2025 2:48 AM
>>>>>
> 
> [snip]
> 
>>>>>> @@ -182,20 +221,34 @@ static vm_fault_t fb_deferred_io_track_page(struct fb_info *info, unsigned long
>>>>>>     	}
>>>>>>
>>>>>>     	/*
>>>>>> -	 * We want the page to remain locked from ->page_mkwrite until
>>>>>> -	 * the PTE is marked dirty to avoid mapping_wrprotect_range()
>>>>>> -	 * being called before the PTE is updated, which would leave
>>>>>> -	 * the page ignored by defio.
>>>>>> -	 * Do this by locking the page here and informing the caller
>>>>>> -	 * about it with VM_FAULT_LOCKED.
>>>>>> +	 * The PTE must be marked writable before the defio deferred work runs
>>>>>> +	 * again and potentially marks the PTE write-protected. If the order
>>>>>> +	 * should be switched, the PTE would become writable without defio
>>>>>> +	 * tracking the page, leaving the page forever ignored by defio.
>>>>>> +	 *
>>>>>> +	 * For vmalloc() framebuffers, the associated struct page is locked
>>>>>> +	 * before releasing the defio lock. mm will later mark the PTE writaable
>>>>>> +	 * and release the struct page lock. The struct page lock prevents
>>>>>> +	 * the page from being prematurely being marked write-protected.
>>>>>> +	 *
>>>>>> +	 * For FBINFO_KMEMFB framebuffers, mm assumes there is no struct page,
>>>>>> +	 * so the PTE must be marked writable while the defio lock is held.
>>>>>>     	 */
>>>>>> -	lock_page(pageref->page);
>>>>>> +	if (info->flags & FBINFO_KMEMFB) {
>>>>>> +		unsigned long pfn = page_to_pfn(pageref->page);
>>>>>> +
>>>>>> +		ret = vmf_insert_mixed_mkwrite(vmf->vma, vmf->address,
>>>>>> +					       __pfn_to_pfn_t(pfn, PFN_SPECIAL));
>>>>>
>>>>> Will the VMA have VM_PFNMAP or VM_MIXEDMAP set? PFN_SPECIAL is a
>>>>> horrible hack.
>>>>>
>>>>> In another thread, you mention that you use PFN_SPECIAL to bypass the
>>>>> check in vm_mixed_ok(), so VM_MIXEDMAP is likely not set?
>>>>
>>>> The VMA has VM_PFNMAP set, not VM_MIXEDMAP.  It seemed like
>>>> VM_MIXEDMAP is somewhat of a superset of VM_PFNMAP, but maybe that's
>>>> a wrong impression.
>>>
>>> VM_PFNMAP: nothing is refcounted except anon pages
>>>
>>> VM_MIXEDMAP: anything with a "struct page" (pfn_valid()) is refcounted
>>>
>>> pte_special() is a way for GUP-fast to distinguish these refcounted (can
>>> GUP) from non-refcounted (camnnot GUP) pages mapped by PTEs without any
>>> locks or the VMA being available.
>>>
>>> Setting pte_special() in VM_MIXEDMAP on ptes that have a "struct page"
>>> (pfn_valid()) is likely very bogus.
>>
>> OK, good to know.
>>
>>>
>>>> vm_mixed_ok() does a thorough job of validating the
>>>> use of __vm_insert_mixed(), and since what I did was allowed, I thought
>>>> perhaps it was OK. Your feedback has set me straight, and that's what I
>>>> needed. :-)
>>>
>>> What exactly are you trying to achieve? :)
>>>
>>> If it's mapping a page with a "struct page" and *not* refcounting it,
>>> then vmf_insert_pfn() is the current way to achieve that in a VM_PFNMAP
>>> mapping. It will set pte_special() automatically for you.
>>>
>>
>> Yes, that's what I'm using to initially create the special PTE in the
>> .fault callback.
>>
>>>>
>>>> But the whole approach is moot with Alistair Popple's patch set that
>>>> eliminates pfn_t. Is there an existing mm API that will do mkwrite on a
>>>> special PTE in a VM_PFNMAP VMA? I didn't see one, but maybe I missed
>>>> it. If there's not one, I'll take a crack at adding it in the next version of my
>>>> patch set.
>>>
>>> I assume you'd want vmf_insert_pfn_mkwrite(), correct? Probably
>>> vmf_insert_pfn_prot() can be used by adding PAGE_WRITE to pgprot. (maybe
>>> :) )
>>
>> Ok, I'll look at that more closely. The sequence is that the special
>> PTE gets created with vmf_insert_pfn(). Then when the page is first
>> written to, the .pfn_mkwrite callback is invoked by mm. The question
>> is the best way for that callback to mark the existing PTE as writable.
>>
> 
> FWIW, vmf_insert_pfn_prot() won't work. It calls insert_pfn() with
> the "mkwrite" parameter set to 'false', in which case insert_pfn()
> does nothing if the PTE already exists.

Ah, you are worried about the "already exists but is R/O case".

> 
> So I would need to create a new API that does appropriate validation
> for a VM_PFNMAP VMA, and then calls insert_pfn() with the "mkwrite"
> parameter set to 'true'.

IMHO, nothing would speak against vmf_insert_pfn_mkwrite().

Much better than using that "mixed" ... beauty of a function.
Michael Kelley June 5, 2025, 5:38 p.m. UTC | #8
From: Thomas Zimmermann <tzimmermann@suse.de> Sent: Thursday, June 5, 2025 8:36 AM
> 
> Hi
> 
> Am 04.06.25 um 23:43 schrieb Michael Kelley:
> [...]
> > Nonetheless, there's an underlying issue. A main cause of the difference
> > is the number of messages to Hyper-V to update dirty regions. With
> > hyperv_fb using deferred I/O, the messages are limited 20/second, so
> > the total number of messages to Hyper-V is about 480. But hyperv_drm
> > appears to send 3 messages to Hyper-V for each line of output, or a total of
> > about 3,000,000 messages (~90K/second). That's a lot of additional load
> > on the Hyper-V host, and it adds the 10 seconds of additional elapsed
> > time seen in the guest. There also this ugly output in dmesg because the
> > ring buffer for sending messages to the Hyper-V host gets full -- Hyper-V
> > doesn't always keep up, at least not on my local laptop where I'm
> > testing:
> >
> > [12574.327615] hyperv_drm 5620e0c7-8062-4dce-aeb7-520c7ef76171: [drm]
> *ERROR* Unable to send packet via vmbus; error -11
> > [12574.327684] hyperv_drm 5620e0c7-8062-4dce-aeb7-520c7ef76171: [drm]
> *ERROR* Unable to send packet via vmbus; error -11
> > [12574.327760] hyperv_drm 5620e0c7-8062-4dce-aeb7-520c7ef76171: [drm]
> *ERROR* Unable to send packet via vmbus; error -11
> > [12574.327841] hyperv_drm 5620e0c7-8062-4dce-aeb7-520c7ef76171: [drm]
> *ERROR* Unable to send packet via vmbus; error -11
> > [12597.016128] hyperv_sendpacket: 6211 callbacks suppressed
> > [12597.016133] hyperv_drm 5620e0c7-8062-4dce-aeb7-520c7ef76171: [drm]
> *ERROR* Unable to send packet via vmbus; error -11
> > [12597.016172] hyperv_drm 5620e0c7-8062-4dce-aeb7-520c7ef76171: [drm]
> *ERROR* Unable to send packet via vmbus; error -11
> > [12597.016220] hyperv_drm 5620e0c7-8062-4dce-aeb7-520c7ef76171: [drm]
> *ERROR* Unable to send packet via vmbus; error -11
> > [12597.016267] hyperv_drm 5620e0c7-8062-4dce-aeb7-520c7ef76171: [drm]
> *ERROR* Unable to send packet via vmbus; error -11
> >
> > hyperv_drm could be fixed to not output the ugly messages, but there's
> > still the underlying issue of overrunning the ring buffer, and excessively
> > hammering on the host. If we could get hyperv_drm doing deferred I/O, I
> > would feel much better about going full-on with deprecating hyperv_fb.
> 
> I try to address the problem with the patches at
> 
> https://lore.kernel.org/dri-devel/20250605152637.98493-1-tzimmermann@suse.de/
> 
> Testing and feedback is much appreciated.
> 

Nice!

I ran the same test case with your patches, and everything works well. The
hyperv_drm numbers are now pretty much the same as the hyperv_fb
numbers for both elapsed time and system CPU time -- within a few percent.
For hyperv_drm, there's no longer a gap in the elapsed time and system
CPU time. No errors due to the guest-to-host ring buffer being full. Total
messages to Hyper-V for hyperv_drm are now a few hundred instead of 3M.
The hyperv_drm message count is still a little higher than for hyperv_fb,
presumably because the simulated vblank rate in hyperv_drm is higher than
the 20 Hz rate used by hyperv_fb deferred I/O. But the overall numbers are
small enough that the difference is in the noise. Question: what is the default
value for the simulated vblank rate? Just curious ...

Michael
diff mbox series

Patch

diff --git a/drivers/video/fbdev/core/fb_defio.c b/drivers/video/fbdev/core/fb_defio.c
index 4fc93f253e06..f8ae91a1c4df 100644
--- a/drivers/video/fbdev/core/fb_defio.c
+++ b/drivers/video/fbdev/core/fb_defio.c
@@ -8,11 +8,40 @@ 
  * for more details.
  */
 
+/*
+ * Deferred I/O ("defio") allows framebuffers that are mmap()'ed to user space
+ * to batch user space writes into periodic updates to the underlying
+ * framebuffer hardware or other implementation (such as with a virtualized
+ * framebuffer in a VM). At each batch interval, a callback is invoked in the
+ * framebuffer's kernel driver, and the callback is supplied with a list of
+ * pages that have been modified in the preceding interval. The callback can
+ * use this information to update the framebuffer hardware as necessary. The
+ * batching can improve performance and reduce the overhead of updating the
+ * hardware.
+ *
+ * Defio is supported on framebuffers allocated using vmalloc() and allocated
+ * as contiguous kernel memory using alloc_pages() or kmalloc(). These
+ * memory allocations all have corresponding "struct page"s. Framebuffers
+ * allocated using dma_alloc_coherent() should not be used with defio.
+ * Such allocations should be treated as a black box owned by the DMA
+ * layer, and should not be deconstructed into individual pages as defio
+ * does. Framebuffers in MMIO space are *not* supported because MMIO space
+ * does not have corrresponding "struct page"s.
+ *
+ * For framebuffers allocated using vmalloc(), struct fb_info must have
+ * "screen_buffer" set to the vmalloc address of the framebuffer. For
+ * framebuffers allocated from contiguous kernel memory, FBINFO_KMEMFB must
+ * be set, and "fix.smem_start" must be set to the physical address of the
+ * frame buffer. In both cases, "fix.smem_len" must be set to the framebuffer
+ * size in bytes.
+ */
+
 #include <linux/module.h>
 #include <linux/kernel.h>
 #include <linux/errno.h>
 #include <linux/string.h>
 #include <linux/mm.h>
+#include <linux/pfn_t.h>
 #include <linux/vmalloc.h>
 #include <linux/delay.h>
 #include <linux/interrupt.h>
@@ -37,7 +66,7 @@  static struct page *fb_deferred_io_get_page(struct fb_info *info, unsigned long
 	else if (info->fix.smem_start)
 		page = pfn_to_page((info->fix.smem_start + offs) >> PAGE_SHIFT);
 
-	if (page)
+	if (page && !(info->flags & FBINFO_KMEMFB))
 		get_page(page);
 
 	return page;
@@ -137,6 +166,15 @@  static vm_fault_t fb_deferred_io_fault(struct vm_fault *vmf)
 
 	BUG_ON(!info->fbdefio->mapping);
 
+	if (info->flags & FBINFO_KMEMFB)
+		/*
+		 * In this path, the VMA is marked VM_PFNMAP, so mm assumes
+		 * there is no struct page associated with the page. The
+		 * PFN must be directly inserted and the created PTE will be
+		 * marked "special".
+		 */
+		return vmf_insert_pfn(vmf->vma, vmf->address, page_to_pfn(page));
+
 	vmf->page = page;
 	return 0;
 }
@@ -163,13 +201,14 @@  EXPORT_SYMBOL_GPL(fb_deferred_io_fsync);
 
 /*
  * Adds a page to the dirty list. Call this from struct
- * vm_operations_struct.page_mkwrite.
+ * vm_operations_struct.page_mkwrite or .pfn_mkwrite.
  */
-static vm_fault_t fb_deferred_io_track_page(struct fb_info *info, unsigned long offset,
+static vm_fault_t fb_deferred_io_track_page(struct fb_info *info, struct vm_fault *vmf,
 					    struct page *page)
 {
 	struct fb_deferred_io *fbdefio = info->fbdefio;
 	struct fb_deferred_io_pageref *pageref;
+	unsigned long offset = vmf->pgoff << PAGE_SHIFT;
 	vm_fault_t ret;
 
 	/* protect against the workqueue changing the page list */
@@ -182,20 +221,34 @@  static vm_fault_t fb_deferred_io_track_page(struct fb_info *info, unsigned long
 	}
 
 	/*
-	 * We want the page to remain locked from ->page_mkwrite until
-	 * the PTE is marked dirty to avoid mapping_wrprotect_range()
-	 * being called before the PTE is updated, which would leave
-	 * the page ignored by defio.
-	 * Do this by locking the page here and informing the caller
-	 * about it with VM_FAULT_LOCKED.
+	 * The PTE must be marked writable before the defio deferred work runs
+	 * again and potentially marks the PTE write-protected. If the order
+	 * should be switched, the PTE would become writable without defio
+	 * tracking the page, leaving the page forever ignored by defio.
+	 *
+	 * For vmalloc() framebuffers, the associated struct page is locked
+	 * before releasing the defio lock. mm will later mark the PTE writaable
+	 * and release the struct page lock. The struct page lock prevents
+	 * the page from being prematurely being marked write-protected.
+	 *
+	 * For FBINFO_KMEMFB framebuffers, mm assumes there is no struct page,
+	 * so the PTE must be marked writable while the defio lock is held.
 	 */
-	lock_page(pageref->page);
+	if (info->flags & FBINFO_KMEMFB) {
+		unsigned long pfn = page_to_pfn(pageref->page);
+
+		ret = vmf_insert_mixed_mkwrite(vmf->vma, vmf->address,
+					       __pfn_to_pfn_t(pfn, PFN_SPECIAL));
+	} else {
+		lock_page(pageref->page);
+		ret = VM_FAULT_LOCKED;
+	}
 
 	mutex_unlock(&fbdefio->lock);
 
 	/* come back after delay to process the deferred IO */
 	schedule_delayed_work(&info->deferred_work, fbdefio->delay);
-	return VM_FAULT_LOCKED;
+	return ret;
 
 err_mutex_unlock:
 	mutex_unlock(&fbdefio->lock);
@@ -207,10 +260,10 @@  static vm_fault_t fb_deferred_io_track_page(struct fb_info *info, unsigned long
  * @fb_info: The fbdev info structure
  * @vmf: The VM fault
  *
- * This is a callback we get when userspace first tries to
- * write to the page. We schedule a workqueue. That workqueue
- * will eventually mkclean the touched pages and execute the
- * deferred framebuffer IO. Then if userspace touches a page
+ * This is a callback we get when userspace first tries to write to a
+ * page. We schedule a workqueue. That workqueue will eventually do
+ * mapping_wrprotect_range() on the written pages and execute the
+ * deferred framebuffer IO. Then if userspace writes to a page
  * again, we repeat the same scheme.
  *
  * Returns:
@@ -218,12 +271,11 @@  static vm_fault_t fb_deferred_io_track_page(struct fb_info *info, unsigned long
  */
 static vm_fault_t fb_deferred_io_page_mkwrite(struct fb_info *info, struct vm_fault *vmf)
 {
-	unsigned long offset = vmf->pgoff << PAGE_SHIFT;
 	struct page *page = vmf->page;
 
 	file_update_time(vmf->vma->vm_file);
 
-	return fb_deferred_io_track_page(info, offset, page);
+	return fb_deferred_io_track_page(info, vmf, page);
 }
 
 /* vm_ops->page_mkwrite handler */
@@ -234,9 +286,25 @@  static vm_fault_t fb_deferred_io_mkwrite(struct vm_fault *vmf)
 	return fb_deferred_io_page_mkwrite(info, vmf);
 }
 
+/*
+ * Similar to fb_deferred_io_mkwrite(), but for first writes to pages
+ * in VMAs that have VM_PFNMAP set.
+ */
+static vm_fault_t fb_deferred_io_pfn_mkwrite(struct vm_fault *vmf)
+{
+	struct fb_info *info = vmf->vma->vm_private_data;
+	unsigned long offset = vmf->pgoff << PAGE_SHIFT;
+	struct page *page = phys_to_page(info->fix.smem_start + offset);
+
+	file_update_time(vmf->vma->vm_file);
+
+	return fb_deferred_io_track_page(info, vmf, page);
+}
+
 static const struct vm_operations_struct fb_deferred_io_vm_ops = {
 	.fault		= fb_deferred_io_fault,
 	.page_mkwrite	= fb_deferred_io_mkwrite,
+	.pfn_mkwrite	= fb_deferred_io_pfn_mkwrite,
 };
 
 static const struct address_space_operations fb_deferred_io_aops = {
@@ -246,11 +314,31 @@  static const struct address_space_operations fb_deferred_io_aops = {
 int fb_deferred_io_mmap(struct fb_info *info, struct vm_area_struct *vma)
 {
 	vma->vm_page_prot = pgprot_decrypted(vma->vm_page_prot);
+	vm_flags_t flags = VM_DONTEXPAND | VM_DONTDUMP;
 
 	vma->vm_ops = &fb_deferred_io_vm_ops;
-	vm_flags_set(vma, VM_DONTEXPAND | VM_DONTDUMP);
-	if (!(info->flags & FBINFO_VIRTFB))
-		vm_flags_set(vma, VM_IO);
+	if (info->flags & FBINFO_KMEMFB) {
+		/*
+		 * I/O fault path calls vmf_insert_pfn(), which bug checks
+		 * if the vma is not marked shared. mmap'ing the framebuffer
+		 * as PRIVATE doesn't really make sense anyway, though doing
+		 * so isn't harmful for vmalloc() framebuffers. So there's
+		 * no prohibition for that case.
+		 */
+		if (!(vma->vm_flags & VM_SHARED))
+			return -EINVAL;
+		/*
+		 * Set VM_PFNMAP so mm code will not try to manage the pages'
+		 * lifecycles. We don't want individual pages to be freed
+		 * based on refcount. Instead the memory must be returned to
+		 * the free pool in the usual way. Cf. the implementation of
+		 * remap_pfn_range() and remap_pfn_range_internal().
+		 */
+		flags |= VM_PFNMAP | VM_IO;
+	} else if (!(info->flags & FBINFO_VIRTFB)) {
+		flags |= VM_IO;
+	}
+	vm_flags_set(vma, flags);
 	vma->vm_private_data = info;
 	return 0;
 }