Message ID | 20240419034034.2842-1-quic_bqiang@quicinc.com |
---|---|
State | New |
Headers | show |
Series | wifi: ath12k: fix kernel crash during resume | expand |
Baochen Qiang <quic_bqiang@quicinc.com> writes: > Currently during resume, QMI target memory is not properly handled, resulting > in kernel crash in case DMA remap is not supported: > > BUG: Bad page state in process kworker/u16:54 pfn:36e80 > page: refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x36e80 > page dumped because: nonzero _refcount > Call Trace: > bad_page > free_page_is_bad_report > __free_pages_ok > __free_pages > dma_direct_free > dma_free_attrs > ath12k_qmi_free_target_mem_chunk > ath12k_qmi_msg_mem_request_cb > > The reason is: > Once ath12k module is loaded, firmware sends memory request to host. In case > DMA remap not supported, ath12k refuses the first request due to failure in > allocating with large segment size: > > ath12k_pci 0000:04:00.0: qmi firmware request memory request > ath12k_pci 0000:04:00.0: qmi mem seg type 1 size 7077888 > ath12k_pci 0000:04:00.0: qmi mem seg type 4 size 8454144 > ath12k_pci 0000:04:00.0: qmi dma allocation failed (7077888 B type 1), will try later with small size > ath12k_pci 0000:04:00.0: qmi delays mem_request 2 > ath12k_pci 0000:04:00.0: qmi firmware request memory request > > Later firmware comes back with more but small segments and allocation > succeeds: > > ath12k_pci 0000:04:00.0: qmi mem seg type 1 size 524288 > ath12k_pci 0000:04:00.0: qmi mem seg type 1 size 524288 > ath12k_pci 0000:04:00.0: qmi mem seg type 1 size 524288 > ath12k_pci 0000:04:00.0: qmi mem seg type 1 size 524288 > ath12k_pci 0000:04:00.0: qmi mem seg type 1 size 524288 > ath12k_pci 0000:04:00.0: qmi mem seg type 1 size 524288 > ath12k_pci 0000:04:00.0: qmi mem seg type 1 size 524288 > ath12k_pci 0000:04:00.0: qmi mem seg type 1 size 262144 > ath12k_pci 0000:04:00.0: qmi mem seg type 1 size 524288 > ath12k_pci 0000:04:00.0: qmi mem seg type 1 size 524288 > ath12k_pci 0000:04:00.0: qmi mem seg type 1 size 524288 > ath12k_pci 0000:04:00.0: qmi mem seg type 1 size 524288 > ath12k_pci 0000:04:00.0: qmi mem seg type 1 size 524288 > ath12k_pci 0000:04:00.0: qmi mem seg type 4 size 524288 > ath12k_pci 0000:04:00.0: qmi mem seg type 4 size 524288 > ath12k_pci 0000:04:00.0: qmi mem seg type 4 size 524288 > ath12k_pci 0000:04:00.0: qmi mem seg type 4 size 524288 > ath12k_pci 0000:04:00.0: qmi mem seg type 4 size 524288 > ath12k_pci 0000:04:00.0: qmi mem seg type 4 size 524288 > ath12k_pci 0000:04:00.0: qmi mem seg type 4 size 524288 > ath12k_pci 0000:04:00.0: qmi mem seg type 4 size 524288 > ath12k_pci 0000:04:00.0: qmi mem seg type 4 size 524288 > ath12k_pci 0000:04:00.0: qmi mem seg type 4 size 524288 > ath12k_pci 0000:04:00.0: qmi mem seg type 4 size 524288 > ath12k_pci 0000:04:00.0: qmi mem seg type 4 size 524288 > ath12k_pci 0000:04:00.0: qmi mem seg type 4 size 524288 > ath12k_pci 0000:04:00.0: qmi mem seg type 4 size 524288 > ath12k_pci 0000:04:00.0: qmi mem seg type 4 size 524288 > ath12k_pci 0000:04:00.0: qmi mem seg type 4 size 524288 > ath12k_pci 0000:04:00.0: qmi mem seg type 4 size 65536 > ath12k_pci 0000:04:00.0: qmi mem seg type 1 size 524288 > > Now ath12k is working. If suspend is triggered, firmware will be reloaded > during resume. As same as before, firmware requests two large segments at > first. In ath12k_qmi_msg_mem_request_cb() segment count and size are > assigned: > > ab->qmi.mem_seg_count == 2 > ab->qmi.target_mem[0].size == 7077888 > ab->qmi.target_mem[1].size == 8454144 > > Then allocation failed like before and ath12k_qmi_free_target_mem_chunk() > is called to free all allocated segments. Note the first segment is skipped > because its v.addr is cleared due to allocation failure: > > chunk->v.addr = dma_alloc_coherent() > > Also note that this leaks that segment because it has not been freed. > > While freeing the second segment, a size of 8454144 is passed to > dma_free_coherent(). However remember that this segment is allocated at > the first time firmware is loaded, before suspend. So its real size is > 524288, much smaller than 8454144. As a result kernel found we are freeing > some memory which is in use and thus crashed. > > So one possible fix would be to free those segments during suspend. This > works because with them freed, ath12k_qmi_free_target_mem_chunk() does > nothing: all segment addresses are NULL so dma_free_coherent() is not called. > > But note that ath11k has similar logic but never hits this issue. Reviewing > code there shows the luck comes from QMI memory reuse logic. So the decision > is to port it to ath12k. Like in ath11k, the crash is avoided by adding > prev_size to target_mem_chunk structure and caching real segment size in it, > then prev_size instead of current size is passed to dma_free_coherent(), > no unexpected memory is freed now. > > Also reuse m3 buffer. > > Tested-on: WCN7850 hw2.0 PCI WLAN.HMT.1.0-03427-QCAHMTSWPL_V1.0_V2.0_SILICONZ-1.15378.4 > Tested-on: WCN7850 hw2.0 PCI WLAN.HMT.1.0.c5-00481-QCAHMTSWPL_V1.0_V2.0_SILICONZ-3 > > Fixes: 64430ddfb132 ("wifi: ath12k: support suspend/resume") > Signed-off-by: Baochen Qiang <quic_bqiang@quicinc.com> To keep git bisect clean should this patch be applied before applying the suspend patchset? No need resend or anything, just checking what should be the correct order.
Kalle Valo <kvalo@kernel.org> writes: > Kalle Valo <kvalo@kernel.org> writes: > >> Baochen Qiang <quic_bqiang@quicinc.com> writes: >> >>> Currently during resume, QMI target memory is not properly handled, resulting >>> in kernel crash in case DMA remap is not supported: >>> >>> BUG: Bad page state in process kworker/u16:54 pfn:36e80 >>> page: refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x36e80 >>> page dumped because: nonzero _refcount >>> Call Trace: >>> bad_page >>> free_page_is_bad_report >>> __free_pages_ok >>> __free_pages >>> dma_direct_free >>> dma_free_attrs >>> ath12k_qmi_free_target_mem_chunk >>> ath12k_qmi_msg_mem_request_cb >>> >>> The reason is: >>> Once ath12k module is loaded, firmware sends memory request to host. In case >>> DMA remap not supported, ath12k refuses the first request due to failure in >>> allocating with large segment size: >>> >>> ath12k_pci 0000:04:00.0: qmi firmware request memory request >>> ath12k_pci 0000:04:00.0: qmi mem seg type 1 size 7077888 >>> ath12k_pci 0000:04:00.0: qmi mem seg type 4 size 8454144 >>> ath12k_pci 0000:04:00.0: qmi dma allocation failed (7077888 B type 1), will try later with small size >>> ath12k_pci 0000:04:00.0: qmi delays mem_request 2 >>> ath12k_pci 0000:04:00.0: qmi firmware request memory request >>> >>> Later firmware comes back with more but small segments and allocation >>> succeeds: >>> >>> ath12k_pci 0000:04:00.0: qmi mem seg type 1 size 524288 >>> ath12k_pci 0000:04:00.0: qmi mem seg type 1 size 524288 >>> ath12k_pci 0000:04:00.0: qmi mem seg type 1 size 524288 >>> ath12k_pci 0000:04:00.0: qmi mem seg type 1 size 524288 >>> ath12k_pci 0000:04:00.0: qmi mem seg type 1 size 524288 >>> ath12k_pci 0000:04:00.0: qmi mem seg type 1 size 524288 >>> ath12k_pci 0000:04:00.0: qmi mem seg type 1 size 524288 >>> ath12k_pci 0000:04:00.0: qmi mem seg type 1 size 262144 >>> ath12k_pci 0000:04:00.0: qmi mem seg type 1 size 524288 >>> ath12k_pci 0000:04:00.0: qmi mem seg type 1 size 524288 >>> ath12k_pci 0000:04:00.0: qmi mem seg type 1 size 524288 >>> ath12k_pci 0000:04:00.0: qmi mem seg type 1 size 524288 >>> ath12k_pci 0000:04:00.0: qmi mem seg type 1 size 524288 >>> ath12k_pci 0000:04:00.0: qmi mem seg type 4 size 524288 >>> ath12k_pci 0000:04:00.0: qmi mem seg type 4 size 524288 >>> ath12k_pci 0000:04:00.0: qmi mem seg type 4 size 524288 >>> ath12k_pci 0000:04:00.0: qmi mem seg type 4 size 524288 >>> ath12k_pci 0000:04:00.0: qmi mem seg type 4 size 524288 >>> ath12k_pci 0000:04:00.0: qmi mem seg type 4 size 524288 >>> ath12k_pci 0000:04:00.0: qmi mem seg type 4 size 524288 >>> ath12k_pci 0000:04:00.0: qmi mem seg type 4 size 524288 >>> ath12k_pci 0000:04:00.0: qmi mem seg type 4 size 524288 >>> ath12k_pci 0000:04:00.0: qmi mem seg type 4 size 524288 >>> ath12k_pci 0000:04:00.0: qmi mem seg type 4 size 524288 >>> ath12k_pci 0000:04:00.0: qmi mem seg type 4 size 524288 >>> ath12k_pci 0000:04:00.0: qmi mem seg type 4 size 524288 >>> ath12k_pci 0000:04:00.0: qmi mem seg type 4 size 524288 >>> ath12k_pci 0000:04:00.0: qmi mem seg type 4 size 524288 >>> ath12k_pci 0000:04:00.0: qmi mem seg type 4 size 524288 >>> ath12k_pci 0000:04:00.0: qmi mem seg type 4 size 65536 >>> ath12k_pci 0000:04:00.0: qmi mem seg type 1 size 524288 >>> >>> Now ath12k is working. If suspend is triggered, firmware will be reloaded >>> during resume. As same as before, firmware requests two large segments at >>> first. In ath12k_qmi_msg_mem_request_cb() segment count and size are >>> assigned: >>> >>> ab->qmi.mem_seg_count == 2 >>> ab->qmi.target_mem[0].size == 7077888 >>> ab->qmi.target_mem[1].size == 8454144 >>> >>> Then allocation failed like before and ath12k_qmi_free_target_mem_chunk() >>> is called to free all allocated segments. Note the first segment is skipped >>> because its v.addr is cleared due to allocation failure: >>> >>> chunk->v.addr = dma_alloc_coherent() >>> >>> Also note that this leaks that segment because it has not been freed. >>> >>> While freeing the second segment, a size of 8454144 is passed to >>> dma_free_coherent(). However remember that this segment is allocated at >>> the first time firmware is loaded, before suspend. So its real size is >>> 524288, much smaller than 8454144. As a result kernel found we are freeing >>> some memory which is in use and thus crashed. >>> >>> So one possible fix would be to free those segments during suspend. This >>> works because with them freed, ath12k_qmi_free_target_mem_chunk() does >>> nothing: all segment addresses are NULL so dma_free_coherent() is not called. >>> >>> But note that ath11k has similar logic but never hits this issue. Reviewing >>> code there shows the luck comes from QMI memory reuse logic. So the decision >>> is to port it to ath12k. Like in ath11k, the crash is avoided by adding >>> prev_size to target_mem_chunk structure and caching real segment size in it, >>> then prev_size instead of current size is passed to dma_free_coherent(), >>> no unexpected memory is freed now. >>> >>> Also reuse m3 buffer. >>> >>> Tested-on: WCN7850 hw2.0 PCI WLAN.HMT.1.0-03427-QCAHMTSWPL_V1.0_V2.0_SILICONZ-1.15378.4 >>> Tested-on: WCN7850 hw2.0 PCI WLAN.HMT.1.0.c5-00481-QCAHMTSWPL_V1.0_V2.0_SILICONZ-3 >>> >>> Fixes: 64430ddfb132 ("wifi: ath12k: support suspend/resume") > > ERROR: Commit id not found from current branch: Fixes: 64430ddfb132 > ("wifi: ath12k: support suspend/resume") > > Do note that commit ids in the pending branch are not stable so fixes > tags cannot be used for those commits. > >>> Signed-off-by: Baochen Qiang <quic_bqiang@quicinc.com> >> >> To keep git bisect clean should this patch be applied before applying >> the suspend patchset? No need resend or anything, just checking what >> should be the correct order. > > Ah, I just checked and it looks like this patch depends on the suspend > patchset. I now applied this patch to pendign branch > (ath-pending-202404190433). To avoid breaking git bisect I now reshuffled the pending branch so that 'wifi: ath12k: fix kernel crash during resume' is before the suspend patchset and my plan is to commit the patches in same order. Does this make sense for everyone? I had a simple conflict in ath12k_core_reset() in these commits: https://git.kernel.org/pub/scm/linux/kernel/git/kvalo/ath.git/commit/?h=pending&id=052a6a7c74ae4f20a31a632184f4439edcb0f40c https://git.kernel.org/pub/scm/linux/kernel/git/kvalo/ath.git/commit/?h=pending&id=6bc740619993535cb64c4dcebb83c66acd506763 Please check my changes.
On 4/22/2024 6:12 AM, Kalle Valo wrote: > To avoid breaking git bisect I now reshuffled the pending branch so that > 'wifi: ath12k: fix kernel crash during resume' is before the suspend > patchset and my plan is to commit the patches in same order. Does this > make sense for everyone? > > I had a simple conflict in ath12k_core_reset() in these commits: > > https://git.kernel.org/pub/scm/linux/kernel/git/kvalo/ath.git/commit/?h=pending&id=052a6a7c74ae4f20a31a632184f4439edcb0f40c > > https://git.kernel.org/pub/scm/linux/kernel/git/kvalo/ath.git/commit/?h=pending&id=6bc740619993535cb64c4dcebb83c66acd506763 > > Please check my changes. > These look OK to me. I'll test the pending branch on my laptop
On 4/22/2024 8:41 AM, Jeff Johnson wrote: > On 4/22/2024 6:12 AM, Kalle Valo wrote: >> To avoid breaking git bisect I now reshuffled the pending branch so that >> 'wifi: ath12k: fix kernel crash during resume' is before the suspend >> patchset and my plan is to commit the patches in same order. Does this >> make sense for everyone? >> >> I had a simple conflict in ath12k_core_reset() in these commits: >> >> https://git.kernel.org/pub/scm/linux/kernel/git/kvalo/ath.git/commit/?h=pending&id=052a6a7c74ae4f20a31a632184f4439edcb0f40c >> >> https://git.kernel.org/pub/scm/linux/kernel/git/kvalo/ath.git/commit/?h=pending&id=6bc740619993535cb64c4dcebb83c66acd506763 >> >> Please check my changes. >> > These look OK to me. > I'll test the pending branch on my laptop pending branch crashes on my laptop let me bisect
On 4/22/2024 11:41 AM, Jeff Johnson wrote: > On 4/22/2024 8:41 AM, Jeff Johnson wrote: >> On 4/22/2024 6:12 AM, Kalle Valo wrote: >>> To avoid breaking git bisect I now reshuffled the pending branch so that >>> 'wifi: ath12k: fix kernel crash during resume' is before the suspend >>> patchset and my plan is to commit the patches in same order. Does this >>> make sense for everyone? >>> >>> I had a simple conflict in ath12k_core_reset() in these commits: >>> >>> https://git.kernel.org/pub/scm/linux/kernel/git/kvalo/ath.git/commit/?h=pending&id=052a6a7c74ae4f20a31a632184f4439edcb0f40c >>> >>> https://git.kernel.org/pub/scm/linux/kernel/git/kvalo/ath.git/commit/?h=pending&id=6bc740619993535cb64c4dcebb83c66acd506763 >>> >>> Please check my changes. >>> >> These look OK to me. >> I'll test the pending branch on my laptop > > pending branch crashes on my laptop > let me bisect ok, that was user error. I had updated my workspace on my VM but not on my actual laptop, and the old pending branch on my laptop contained the follow patch which I've already flagged as bad (and is not present in the current pending branch): wifi: ath12k: Fix Tx Completion Ring(WBM2SW) Setup Failure I updated my laptop to: 5085843538af (origin/pending, pending) wifi: ath12k: report station mode signal strength And everything in that pending branch works fine, including hotspot. /jeff
Jeff Johnson <quic_jjohnson@quicinc.com> writes: > On 4/22/2024 11:41 AM, Jeff Johnson wrote: >> On 4/22/2024 8:41 AM, Jeff Johnson wrote: >>> On 4/22/2024 6:12 AM, Kalle Valo wrote: >>>> To avoid breaking git bisect I now reshuffled the pending branch so that >>>> 'wifi: ath12k: fix kernel crash during resume' is before the suspend >>>> patchset and my plan is to commit the patches in same order. Does this >>>> make sense for everyone? >>>> >>>> I had a simple conflict in ath12k_core_reset() in these commits: >>>> >>>> https://git.kernel.org/pub/scm/linux/kernel/git/kvalo/ath.git/commit/?h=pending&id=052a6a7c74ae4f20a31a632184f4439edcb0f40c >>>> >>>> https://git.kernel.org/pub/scm/linux/kernel/git/kvalo/ath.git/commit/?h=pending&id=6bc740619993535cb64c4dcebb83c66acd506763 >>>> >>>> Please check my changes. >>>> >>> These look OK to me. >>> I'll test the pending branch on my laptop >> >> pending branch crashes on my laptop >> let me bisect > > ok, that was user error. > > I had updated my workspace on my VM but not on my actual laptop, and the old > pending branch on my laptop contained the follow patch which I've already > flagged as bad (and is not present in the current pending branch): > > wifi: ath12k: Fix Tx Completion Ring(WBM2SW) Setup Failure > > I updated my laptop to: > 5085843538af (origin/pending, pending) wifi: ath12k: report station mode > signal strength > > And everything in that pending branch works fine, including hotspot. Thanks for testing. Unless no objections from Baochen I'll then apply this fix and the suspend patches.
On 4/23/2024 12:10 PM, Kalle Valo wrote: > Jeff Johnson <quic_jjohnson@quicinc.com> writes: > >> On 4/22/2024 11:41 AM, Jeff Johnson wrote: >>> On 4/22/2024 8:41 AM, Jeff Johnson wrote: >>>> On 4/22/2024 6:12 AM, Kalle Valo wrote: >>>>> To avoid breaking git bisect I now reshuffled the pending branch so that >>>>> 'wifi: ath12k: fix kernel crash during resume' is before the suspend >>>>> patchset and my plan is to commit the patches in same order. Does this >>>>> make sense for everyone? >>>>> >>>>> I had a simple conflict in ath12k_core_reset() in these commits: >>>>> >>>>> https://git.kernel.org/pub/scm/linux/kernel/git/kvalo/ath.git/commit/?h=pending&id=052a6a7c74ae4f20a31a632184f4439edcb0f40c >>>>> >>>>> https://git.kernel.org/pub/scm/linux/kernel/git/kvalo/ath.git/commit/?h=pending&id=6bc740619993535cb64c4dcebb83c66acd506763 >>>>> >>>>> Please check my changes. Looks good to me. >>>>> >>>> These look OK to me. >>>> I'll test the pending branch on my laptop >>> >>> pending branch crashes on my laptop >>> let me bisect >> >> ok, that was user error. >> >> I had updated my workspace on my VM but not on my actual laptop, and the old >> pending branch on my laptop contained the follow patch which I've already >> flagged as bad (and is not present in the current pending branch): >> >> wifi: ath12k: Fix Tx Completion Ring(WBM2SW) Setup Failure >> >> I updated my laptop to: >> 5085843538af (origin/pending, pending) wifi: ath12k: report station mode >> signal strength >> >> And everything in that pending branch works fine, including hotspot. > > Thanks for testing. Unless no objections from Baochen I'll then apply > this fix and the suspend patches. No objection from me, kalle. >
Baochen Qiang <quic_bqiang@quicinc.com> wrote: > Currently during resume, QMI target memory is not properly handled, resulting > in kernel crash in case DMA remap is not supported: > > BUG: Bad page state in process kworker/u16:54 pfn:36e80 > page: refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x36e80 > page dumped because: nonzero _refcount > Call Trace: > bad_page > free_page_is_bad_report > __free_pages_ok > __free_pages > dma_direct_free > dma_free_attrs > ath12k_qmi_free_target_mem_chunk > ath12k_qmi_msg_mem_request_cb > > The reason is: > Once ath12k module is loaded, firmware sends memory request to host. In case > DMA remap not supported, ath12k refuses the first request due to failure in > allocating with large segment size: > > ath12k_pci 0000:04:00.0: qmi firmware request memory request > ath12k_pci 0000:04:00.0: qmi mem seg type 1 size 7077888 > ath12k_pci 0000:04:00.0: qmi mem seg type 4 size 8454144 > ath12k_pci 0000:04:00.0: qmi dma allocation failed (7077888 B type 1), will try later with small size > ath12k_pci 0000:04:00.0: qmi delays mem_request 2 > ath12k_pci 0000:04:00.0: qmi firmware request memory request > > Later firmware comes back with more but small segments and allocation > succeeds: > > ath12k_pci 0000:04:00.0: qmi mem seg type 1 size 524288 > ath12k_pci 0000:04:00.0: qmi mem seg type 1 size 524288 > ath12k_pci 0000:04:00.0: qmi mem seg type 1 size 524288 > ath12k_pci 0000:04:00.0: qmi mem seg type 1 size 524288 > ath12k_pci 0000:04:00.0: qmi mem seg type 1 size 524288 > ath12k_pci 0000:04:00.0: qmi mem seg type 1 size 524288 > ath12k_pci 0000:04:00.0: qmi mem seg type 1 size 524288 > ath12k_pci 0000:04:00.0: qmi mem seg type 1 size 262144 > ath12k_pci 0000:04:00.0: qmi mem seg type 1 size 524288 > ath12k_pci 0000:04:00.0: qmi mem seg type 1 size 524288 > ath12k_pci 0000:04:00.0: qmi mem seg type 1 size 524288 > ath12k_pci 0000:04:00.0: qmi mem seg type 1 size 524288 > ath12k_pci 0000:04:00.0: qmi mem seg type 1 size 524288 > ath12k_pci 0000:04:00.0: qmi mem seg type 4 size 524288 > ath12k_pci 0000:04:00.0: qmi mem seg type 4 size 524288 > ath12k_pci 0000:04:00.0: qmi mem seg type 4 size 524288 > ath12k_pci 0000:04:00.0: qmi mem seg type 4 size 524288 > ath12k_pci 0000:04:00.0: qmi mem seg type 4 size 524288 > ath12k_pci 0000:04:00.0: qmi mem seg type 4 size 524288 > ath12k_pci 0000:04:00.0: qmi mem seg type 4 size 524288 > ath12k_pci 0000:04:00.0: qmi mem seg type 4 size 524288 > ath12k_pci 0000:04:00.0: qmi mem seg type 4 size 524288 > ath12k_pci 0000:04:00.0: qmi mem seg type 4 size 524288 > ath12k_pci 0000:04:00.0: qmi mem seg type 4 size 524288 > ath12k_pci 0000:04:00.0: qmi mem seg type 4 size 524288 > ath12k_pci 0000:04:00.0: qmi mem seg type 4 size 524288 > ath12k_pci 0000:04:00.0: qmi mem seg type 4 size 524288 > ath12k_pci 0000:04:00.0: qmi mem seg type 4 size 524288 > ath12k_pci 0000:04:00.0: qmi mem seg type 4 size 524288 > ath12k_pci 0000:04:00.0: qmi mem seg type 4 size 65536 > ath12k_pci 0000:04:00.0: qmi mem seg type 1 size 524288 > > Now ath12k is working. If suspend is triggered, firmware will be reloaded > during resume. As same as before, firmware requests two large segments at > first. In ath12k_qmi_msg_mem_request_cb() segment count and size are > assigned: > > ab->qmi.mem_seg_count == 2 > ab->qmi.target_mem[0].size == 7077888 > ab->qmi.target_mem[1].size == 8454144 > > Then allocation failed like before and ath12k_qmi_free_target_mem_chunk() > is called to free all allocated segments. Note the first segment is skipped > because its v.addr is cleared due to allocation failure: > > chunk->v.addr = dma_alloc_coherent() > > Also note that this leaks that segment because it has not been freed. > > While freeing the second segment, a size of 8454144 is passed to > dma_free_coherent(). However remember that this segment is allocated at > the first time firmware is loaded, before suspend. So its real size is > 524288, much smaller than 8454144. As a result kernel found we are freeing > some memory which is in use and thus crashed. > > So one possible fix would be to free those segments during suspend. This > works because with them freed, ath12k_qmi_free_target_mem_chunk() does > nothing: all segment addresses are NULL so dma_free_coherent() is not called. > > But note that ath11k has similar logic but never hits this issue. Reviewing > code there shows the luck comes from QMI memory reuse logic. So the decision > is to port it to ath12k. Like in ath11k, the crash is avoided by adding > prev_size to target_mem_chunk structure and caching real segment size in it, > then prev_size instead of current size is passed to dma_free_coherent(), > no unexpected memory is freed now. > > Also reuse m3 buffer. > > Tested-on: WCN7850 hw2.0 PCI WLAN.HMT.1.0-03427-QCAHMTSWPL_V1.0_V2.0_SILICONZ-1.15378.4 > Tested-on: WCN7850 hw2.0 PCI WLAN.HMT.1.0.c5-00481-QCAHMTSWPL_V1.0_V2.0_SILICONZ-3 > > Signed-off-by: Baochen Qiang <quic_bqiang@quicinc.com> > Signed-off-by: Kalle Valo <quic_kvalo@quicinc.com> Patch applied to ath-next branch of ath.git, thanks. 303c017821d8 wifi: ath12k: fix kernel crash during resume
diff --git a/drivers/net/wireless/ath/ath12k/core.c b/drivers/net/wireless/ath/ath12k/core.c index aefe667c3dcb..420e5fb002ad 100644 --- a/drivers/net/wireless/ath/ath12k/core.c +++ b/drivers/net/wireless/ath/ath12k/core.c @@ -1158,7 +1158,6 @@ static void ath12k_core_reset(struct work_struct *work) ath12k_hif_ce_irq_disable(ab); ath12k_hif_power_down(ab, false); - ath12k_qmi_free_resource(ab); ath12k_hif_power_up(ab); ath12k_dbg(ab, ATH12K_DBG_BOOT, "reset started\n"); diff --git a/drivers/net/wireless/ath/ath12k/qmi.c b/drivers/net/wireless/ath/ath12k/qmi.c index 2f3c6f2e342a..ebb20fc8e3ba 100644 --- a/drivers/net/wireless/ath/ath12k/qmi.c +++ b/drivers/net/wireless/ath/ath12k/qmi.c @@ -2357,8 +2357,9 @@ static void ath12k_qmi_free_target_mem_chunk(struct ath12k_base *ab) for (i = 0; i < ab->qmi.mem_seg_count; i++) { if (!ab->qmi.target_mem[i].v.addr) continue; + dma_free_coherent(ab->dev, - ab->qmi.target_mem[i].size, + ab->qmi.target_mem[i].prev_size, ab->qmi.target_mem[i].v.addr, ab->qmi.target_mem[i].paddr); ab->qmi.target_mem[i].v.addr = NULL; @@ -2384,6 +2385,20 @@ static int ath12k_qmi_alloc_target_mem_chunk(struct ath12k_base *ab) case M3_DUMP_REGION_TYPE: case PAGEABLE_MEM_REGION_TYPE: case CALDB_MEM_REGION_TYPE: + /* Firmware reloads in recovery/resume. + * In such cases, no need to allocate memory for FW again. + */ + if (chunk->v.addr) { + if (chunk->prev_type == chunk->type && + chunk->prev_size == chunk->size) + goto this_chunk_done; + + /* cannot reuse the existing chunk */ + dma_free_coherent(ab->dev, chunk->prev_size, + chunk->v.addr, chunk->paddr); + chunk->v.addr = NULL; + } + chunk->v.addr = dma_alloc_coherent(ab->dev, chunk->size, &chunk->paddr, @@ -2402,6 +2417,10 @@ static int ath12k_qmi_alloc_target_mem_chunk(struct ath12k_base *ab) chunk->type, chunk->size); return -ENOMEM; } + + chunk->prev_type = chunk->type; + chunk->prev_size = chunk->size; +this_chunk_done: break; default: ath12k_warn(ab, "memory type %u not supported\n", @@ -2707,10 +2726,6 @@ static int ath12k_qmi_m3_load(struct ath12k_base *ab) size_t m3_len; int ret; - if (m3_mem->vaddr) - /* m3 firmware buffer is already available in the DMA buffer */ - return 0; - if (ab->fw.m3_data && ab->fw.m3_len > 0) { /* firmware-N.bin had a m3 firmware file so use that */ m3_data = ab->fw.m3_data; @@ -2732,6 +2747,9 @@ static int ath12k_qmi_m3_load(struct ath12k_base *ab) m3_len = fw->size; } + if (m3_mem->vaddr) + goto skip_m3_alloc; + m3_mem->vaddr = dma_alloc_coherent(ab->dev, m3_len, &m3_mem->paddr, GFP_KERNEL); @@ -2742,6 +2760,7 @@ static int ath12k_qmi_m3_load(struct ath12k_base *ab) goto out; } +skip_m3_alloc: memcpy(m3_mem->vaddr, m3_data, m3_len); m3_mem->size = m3_len; diff --git a/drivers/net/wireless/ath/ath12k/qmi.h b/drivers/net/wireless/ath/ath12k/qmi.h index f7a5eb11ce44..0dfcbd8cb59b 100644 --- a/drivers/net/wireless/ath/ath12k/qmi.h +++ b/drivers/net/wireless/ath/ath12k/qmi.h @@ -96,6 +96,8 @@ struct ath12k_qmi_event_msg { struct target_mem_chunk { u32 size; u32 type; + u32 prev_size; + u32 prev_type; dma_addr_t paddr; union { void __iomem *ioaddr;
Currently during resume, QMI target memory is not properly handled, resulting in kernel crash in case DMA remap is not supported: BUG: Bad page state in process kworker/u16:54 pfn:36e80 page: refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x36e80 page dumped because: nonzero _refcount Call Trace: bad_page free_page_is_bad_report __free_pages_ok __free_pages dma_direct_free dma_free_attrs ath12k_qmi_free_target_mem_chunk ath12k_qmi_msg_mem_request_cb The reason is: Once ath12k module is loaded, firmware sends memory request to host. In case DMA remap not supported, ath12k refuses the first request due to failure in allocating with large segment size: ath12k_pci 0000:04:00.0: qmi firmware request memory request ath12k_pci 0000:04:00.0: qmi mem seg type 1 size 7077888 ath12k_pci 0000:04:00.0: qmi mem seg type 4 size 8454144 ath12k_pci 0000:04:00.0: qmi dma allocation failed (7077888 B type 1), will try later with small size ath12k_pci 0000:04:00.0: qmi delays mem_request 2 ath12k_pci 0000:04:00.0: qmi firmware request memory request Later firmware comes back with more but small segments and allocation succeeds: ath12k_pci 0000:04:00.0: qmi mem seg type 1 size 524288 ath12k_pci 0000:04:00.0: qmi mem seg type 1 size 524288 ath12k_pci 0000:04:00.0: qmi mem seg type 1 size 524288 ath12k_pci 0000:04:00.0: qmi mem seg type 1 size 524288 ath12k_pci 0000:04:00.0: qmi mem seg type 1 size 524288 ath12k_pci 0000:04:00.0: qmi mem seg type 1 size 524288 ath12k_pci 0000:04:00.0: qmi mem seg type 1 size 524288 ath12k_pci 0000:04:00.0: qmi mem seg type 1 size 262144 ath12k_pci 0000:04:00.0: qmi mem seg type 1 size 524288 ath12k_pci 0000:04:00.0: qmi mem seg type 1 size 524288 ath12k_pci 0000:04:00.0: qmi mem seg type 1 size 524288 ath12k_pci 0000:04:00.0: qmi mem seg type 1 size 524288 ath12k_pci 0000:04:00.0: qmi mem seg type 1 size 524288 ath12k_pci 0000:04:00.0: qmi mem seg type 4 size 524288 ath12k_pci 0000:04:00.0: qmi mem seg type 4 size 524288 ath12k_pci 0000:04:00.0: qmi mem seg type 4 size 524288 ath12k_pci 0000:04:00.0: qmi mem seg type 4 size 524288 ath12k_pci 0000:04:00.0: qmi mem seg type 4 size 524288 ath12k_pci 0000:04:00.0: qmi mem seg type 4 size 524288 ath12k_pci 0000:04:00.0: qmi mem seg type 4 size 524288 ath12k_pci 0000:04:00.0: qmi mem seg type 4 size 524288 ath12k_pci 0000:04:00.0: qmi mem seg type 4 size 524288 ath12k_pci 0000:04:00.0: qmi mem seg type 4 size 524288 ath12k_pci 0000:04:00.0: qmi mem seg type 4 size 524288 ath12k_pci 0000:04:00.0: qmi mem seg type 4 size 524288 ath12k_pci 0000:04:00.0: qmi mem seg type 4 size 524288 ath12k_pci 0000:04:00.0: qmi mem seg type 4 size 524288 ath12k_pci 0000:04:00.0: qmi mem seg type 4 size 524288 ath12k_pci 0000:04:00.0: qmi mem seg type 4 size 524288 ath12k_pci 0000:04:00.0: qmi mem seg type 4 size 65536 ath12k_pci 0000:04:00.0: qmi mem seg type 1 size 524288 Now ath12k is working. If suspend is triggered, firmware will be reloaded during resume. As same as before, firmware requests two large segments at first. In ath12k_qmi_msg_mem_request_cb() segment count and size are assigned: ab->qmi.mem_seg_count == 2 ab->qmi.target_mem[0].size == 7077888 ab->qmi.target_mem[1].size == 8454144 Then allocation failed like before and ath12k_qmi_free_target_mem_chunk() is called to free all allocated segments. Note the first segment is skipped because its v.addr is cleared due to allocation failure: chunk->v.addr = dma_alloc_coherent() Also note that this leaks that segment because it has not been freed. While freeing the second segment, a size of 8454144 is passed to dma_free_coherent(). However remember that this segment is allocated at the first time firmware is loaded, before suspend. So its real size is 524288, much smaller than 8454144. As a result kernel found we are freeing some memory which is in use and thus crashed. So one possible fix would be to free those segments during suspend. This works because with them freed, ath12k_qmi_free_target_mem_chunk() does nothing: all segment addresses are NULL so dma_free_coherent() is not called. But note that ath11k has similar logic but never hits this issue. Reviewing code there shows the luck comes from QMI memory reuse logic. So the decision is to port it to ath12k. Like in ath11k, the crash is avoided by adding prev_size to target_mem_chunk structure and caching real segment size in it, then prev_size instead of current size is passed to dma_free_coherent(), no unexpected memory is freed now. Also reuse m3 buffer. Tested-on: WCN7850 hw2.0 PCI WLAN.HMT.1.0-03427-QCAHMTSWPL_V1.0_V2.0_SILICONZ-1.15378.4 Tested-on: WCN7850 hw2.0 PCI WLAN.HMT.1.0.c5-00481-QCAHMTSWPL_V1.0_V2.0_SILICONZ-3 Fixes: 64430ddfb132 ("wifi: ath12k: support suspend/resume") Signed-off-by: Baochen Qiang <quic_bqiang@quicinc.com> --- drivers/net/wireless/ath/ath12k/core.c | 1 - drivers/net/wireless/ath/ath12k/qmi.c | 29 +++++++++++++++++++++----- drivers/net/wireless/ath/ath12k/qmi.h | 2 ++ 3 files changed, 26 insertions(+), 6 deletions(-) base-commit: f5f3a3166c64d469150958a470f4a3ab99d45268