Message ID | 20231213210644.8702-1-robdclark@gmail.com |
---|---|
State | New |
Headers | show |
Series | soc: qcom: pmic_glink: Fix boot when QRTR=m | expand |
On Wed, Dec 13, 2023 at 01:06:43PM -0800, Rob Clark wrote: > From: Rob Clark <robdclark@chromium.org> > > We need to bail out before adding/removing devices, if we are going > to -EPROBE_DEFER. Otherwise boot will get stuck forever at > deferred_probe_initcall(). Can please you expand on why this is a problem here in the commit message? The aux devices appear to be tore down correctly in the probe error paths so how exactly does that lead to deferred_probe_initcall() being stuck? This sounds like we may have a problem elsewhere which this patch is papering over. Also where does the probe deferral come from in your case? pdr_handle_alloc()? If this is a correct fix, I'd also expect there to be a Fixes and CC-stable tag. Johan
On Thu, 14 Dec 2023 at 09:16, Johan Hovold <johan@kernel.org> wrote: > > On Wed, Dec 13, 2023 at 01:06:43PM -0800, Rob Clark wrote: > > From: Rob Clark <robdclark@chromium.org> > > > > We need to bail out before adding/removing devices, if we are going > > to -EPROBE_DEFER. Otherwise boot will get stuck forever at > > deferred_probe_initcall(). > > Can please you expand on why this is a problem here in the commit > message? > > The aux devices appear to be tore down correctly in the probe error > paths so how exactly does that lead to deferred_probe_initcall() being > stuck? This sounds like we may have a problem elsewhere which this patch > is papering over. This is a known problem. Successful probes during the probe deferral loop causes the whole loop to be reiterated. Creating child devices usually results in a successful probe. Aso I thought that just creating new device also causes a reprobe, but I can not find any evidence now. > > Also where does the probe deferral come from in your case? > pdr_handle_alloc()? > > If this is a correct fix, I'd also expect there to be a Fixes and > CC-stable tag. > > Johan >
On Thu, Dec 14, 2023 at 01:04:43PM +0200, Dmitry Baryshkov wrote: > On Thu, 14 Dec 2023 at 09:16, Johan Hovold <johan@kernel.org> wrote: > > > > On Wed, Dec 13, 2023 at 01:06:43PM -0800, Rob Clark wrote: > > > From: Rob Clark <robdclark@chromium.org> > > > > > > We need to bail out before adding/removing devices, if we are going > > > to -EPROBE_DEFER. Otherwise boot will get stuck forever at > > > deferred_probe_initcall(). > > > > Can please you expand on why this is a problem here in the commit > > message? > > > > The aux devices appear to be tore down correctly in the probe error > > paths so how exactly does that lead to deferred_probe_initcall() being > > stuck? This sounds like we may have a problem elsewhere which this patch > > is papering over. > > This is a known problem. Successful probes during the probe deferral > loop causes the whole loop to be reiterated. Creating child devices > usually results in a successful probe. Aso I thought that just > creating new device also causes a reprobe, but I can not find any > evidence now. This still needs to be described in the commit message. Only a successful probe should trigger a reprobe, and when the child devices are registered the parent is not yet on the deferred probe list. So something is not right or missing here. Johan
On Thu, Dec 14, 2023 at 04:04:49PM +0200, Dmitry Baryshkov wrote: > On Thu, 14 Dec 2023 at 16:01, Johan Hovold <johan@kernel.org> wrote: > > On Thu, Dec 14, 2023 at 01:04:43PM +0200, Dmitry Baryshkov wrote: > > > On Thu, 14 Dec 2023 at 09:16, Johan Hovold <johan@kernel.org> wrote: > > > > On Wed, Dec 13, 2023 at 01:06:43PM -0800, Rob Clark wrote: > > > > > From: Rob Clark <robdclark@chromium.org> > > > > > > > > > > We need to bail out before adding/removing devices, if we are going > > > > > to -EPROBE_DEFER. Otherwise boot will get stuck forever at > > > > > deferred_probe_initcall(). > > > > > > > > Can please you expand on why this is a problem here in the commit > > > > message? > > > > > > > > The aux devices appear to be tore down correctly in the probe error > > > > paths so how exactly does that lead to deferred_probe_initcall() being > > > > stuck? This sounds like we may have a problem elsewhere which this patch > > > > is papering over. > > > > > > This is a known problem. Successful probes during the probe deferral > > > loop causes the whole loop to be reiterated. Creating child devices > > > usually results in a successful probe. Aso I thought that just > > > creating new device also causes a reprobe, but I can not find any > > > evidence now. > > > > This still needs to be described in the commit message. > > > > Only a successful probe should trigger a reprobe, and when the child > > devices are registered the parent is not yet on the deferred probe list. > > So something is not right or missing here. > > Child devices can be successfully probed, then the parent gets > -EPROBE_DEFER, removes children and then it goes on and on. So what? As I described above, the successful probe of the children should have nothing to do with whether the parent is reprobed. If that isn't the case, then explain how. Johan
On Thu, Dec 14, 2023 at 03:09:36PM +0100, Johan Hovold wrote: > On Thu, Dec 14, 2023 at 04:04:49PM +0200, Dmitry Baryshkov wrote: > > On Thu, 14 Dec 2023 at 16:01, Johan Hovold <johan@kernel.org> wrote: > > > On Thu, Dec 14, 2023 at 01:04:43PM +0200, Dmitry Baryshkov wrote: > > > > This is a known problem. Successful probes during the probe deferral > > > > loop causes the whole loop to be reiterated. Creating child devices > > > > usually results in a successful probe. Aso I thought that just > > > > creating new device also causes a reprobe, but I can not find any > > > > evidence now. > > > > > > This still needs to be described in the commit message. > > > > > > Only a successful probe should trigger a reprobe, and when the child > > > devices are registered the parent is not yet on the deferred probe list. > > > So something is not right or missing here. > > > > Child devices can be successfully probed, then the parent gets > > -EPROBE_DEFER, removes children and then it goes on and on. > > So what? As I described above, the successful probe of the children > should have nothing to do with whether the parent is reprobed. > > If that isn't the case, then explain how. I took a closer look at this and indeed we do have code that triggers a reprobe of a device in case there was a successful probe while the device was probing. This was introduced by commit 58b116bce136 ("drivercore: deferral race condition fix") and the workaround for the reprobe-loop bug that hack led to is to not return -EPROBE_DEFER after registering child devices as no one managed to come up with a proper fix. This was documented here: fbc35b45f9f6 ("Add documentation on meaning of -EPROBE_DEFER") But please spell this out in some more detail in the commit message, and add a Fixes and CC stable tag. Johan
On Thu, Dec 14, 2023 at 8:08 AM Johan Hovold <johan@kernel.org> wrote: > > On Thu, Dec 14, 2023 at 04:38:35PM +0100, Johan Hovold wrote: > > > I took a closer look at this and indeed we do have code that triggers a > > reprobe of a device in case there was a successful probe while the > > device was probing. > > > > This was introduced by commit 58b116bce136 ("drivercore: deferral race > > condition fix") and the workaround for the reprobe-loop bug that hack > > led to is to not return -EPROBE_DEFER after registering child devices as > > no one managed to come up with a proper fix. This was documented here: > > > > fbc35b45f9f6 ("Add documentation on meaning of -EPROBE_DEFER") > > > > But please spell this out in some more detail in the commit message, and > > add a Fixes and CC stable tag. > > And please update the commit summary as I've been booting with QRTR=m > all along just fine. I guess the issue is if you have pmic_glink > built-in or in the initramfs but forgot to include qrtr or similar? I do have both QRTR=m and QCOM_GLINK=m. I'm honestly not sure what started triggering this issue for me.. it seemed to have started after merging msm-next + drm-misc-next on top of your jhovold/wip/sc8280xp-v6.7-rc5 (the merged branches were based on -rc3 so this shouldn't have really brought in random non-drm things). Maybe there is a timing element to it? I felt like the problem was obvious enough, and the exact details of why I started hitting this were not important enough to spend time tracking down. BR, -R > Johan
On Thu, Dec 14, 2023 at 12:44:10PM -0800, Rob Clark wrote: > On Thu, Dec 14, 2023 at 8:08 AM Johan Hovold <johan@kernel.org> wrote: > > > > On Thu, Dec 14, 2023 at 04:38:35PM +0100, Johan Hovold wrote: > > > > > I took a closer look at this and indeed we do have code that triggers a > > > reprobe of a device in case there was a successful probe while the > > > device was probing. > > > > > > This was introduced by commit 58b116bce136 ("drivercore: deferral race > > > condition fix") and the workaround for the reprobe-loop bug that hack > > > led to is to not return -EPROBE_DEFER after registering child devices as > > > no one managed to come up with a proper fix. This was documented here: > > > > > > fbc35b45f9f6 ("Add documentation on meaning of -EPROBE_DEFER") > > > > > > But please spell this out in some more detail in the commit message, and > > > add a Fixes and CC stable tag. > > > > And please update the commit summary as I've been booting with QRTR=m > > all along just fine. I guess the issue is if you have pmic_glink > > built-in or in the initramfs but forgot to include qrtr or similar? > > I do have both QRTR=m and QCOM_GLINK=m. I'm honestly not sure what > started triggering this issue for me.. it seemed to have started after > merging msm-next + drm-misc-next on top of your > jhovold/wip/sc8280xp-v6.7-rc5 (the merged branches were based on -rc3 > so this shouldn't have really brought in random non-drm things). > Maybe there is a timing element to it? Yeah, possibly, and device links may also come into play here. > I felt like the problem was obvious enough, and the exact details of > why I started hitting this were not important enough to spend time > tracking down. The patch itself is of course fine itself as a clean up (or microoptimisation) but the claim that it solves, and is the correct fix for, a probe loop issue is not obvious at all and requires some further justification. Since he last time someone suggested reverting the commit that introduced the probe-loop issue, the result was to leave things as they were and just document the workaround, it should be fine to just refer to that commit: fbc35b45f9f6 ("Add documentation on meaning of -EPROBE_DEFER") Perhaps you can keep the Subject if you make it clear in the commit message that this bug isn't always hit with QRTR=m. Johan
diff --git a/drivers/soc/qcom/pmic_glink.c b/drivers/soc/qcom/pmic_glink.c index 914057331afd..2899746af6a6 100644 --- a/drivers/soc/qcom/pmic_glink.c +++ b/drivers/soc/qcom/pmic_glink.c @@ -268,10 +268,16 @@ static int pmic_glink_probe(struct platform_device *pdev) else pg->client_mask = PMIC_GLINK_CLIENT_DEFAULT; + pg->pdr = pdr_handle_alloc(pmic_glink_pdr_callback, pg); + if (IS_ERR(pg->pdr)) { + ret = dev_err_probe(&pdev->dev, PTR_ERR(pg->pdr), "failed to initialize pdr\n"); + return ret; + } + if (pg->client_mask & BIT(PMIC_GLINK_CLIENT_UCSI)) { ret = pmic_glink_add_aux_device(pg, &pg->ucsi_aux, "ucsi"); if (ret) - return ret; + goto out_release_pdr_handle; } if (pg->client_mask & BIT(PMIC_GLINK_CLIENT_ALTMODE)) { ret = pmic_glink_add_aux_device(pg, &pg->altmode_aux, "altmode"); @@ -284,17 +290,11 @@ static int pmic_glink_probe(struct platform_device *pdev) goto out_release_altmode_aux; } - pg->pdr = pdr_handle_alloc(pmic_glink_pdr_callback, pg); - if (IS_ERR(pg->pdr)) { - ret = dev_err_probe(&pdev->dev, PTR_ERR(pg->pdr), "failed to initialize pdr\n"); - goto out_release_aux_devices; - } - service = pdr_add_lookup(pg->pdr, "tms/servreg", "msm/adsp/charger_pd"); if (IS_ERR(service)) { ret = dev_err_probe(&pdev->dev, PTR_ERR(service), "failed adding pdr lookup for charger_pd\n"); - goto out_release_pdr_handle; + goto out_release_aux_devices; } mutex_lock(&__pmic_glink_lock); @@ -303,8 +303,6 @@ static int pmic_glink_probe(struct platform_device *pdev) return 0; -out_release_pdr_handle: - pdr_handle_release(pg->pdr); out_release_aux_devices: if (pg->client_mask & BIT(PMIC_GLINK_CLIENT_BATT)) pmic_glink_del_aux_device(pg, &pg->ps_aux); @@ -314,6 +312,8 @@ static int pmic_glink_probe(struct platform_device *pdev) out_release_ucsi_aux: if (pg->client_mask & BIT(PMIC_GLINK_CLIENT_UCSI)) pmic_glink_del_aux_device(pg, &pg->ucsi_aux); +out_release_pdr_handle: + pdr_handle_release(pg->pdr); return ret; }