mbox series

[v2,00/23] interconnect: fix racy provider registration

Message ID 20230306075651.2449-1-johan+linaro@kernel.org
Headers show
Series interconnect: fix racy provider registration | expand

Message

Johan Hovold March 6, 2023, 7:56 a.m. UTC
The current interconnect provider interface is inherently racy as
providers are expected to be registered before being fully initialised.

This can specifically cause racing DT lookups to fail as I recently
noticed when the Qualcomm cpufreq driver failed to probe:

	of_icc_xlate_onecell: invalid index 0
        cpu cpu0: error -EINVAL: error finding src node
        cpu cpu0: dev_pm_opp_of_find_icc_paths: Unable to get path0: -22
        qcom-cpufreq-hw: probe of 18591000.cpufreq failed with error -22

This only happens very rarely, but the bug is easily reproduced by
increasing the race window by adding an msleep() after registering
osm-l3 interconnect provider.

Note that the Qualcomm cpufreq driver is especially susceptible to this
race as the interconnect path is looked up from the CPU nodes so that
driver core does not guarantee the probe order even when device links
are enabled (which they not always are).

This series adds a new interconnect provider registration API which is
used to fix up the interconnect drivers before removing the old racy
API.

Included are also a number of fixes for other bugs found while preparing
the series.

Johan


Changes in v2
 - icc_node_destroy() can be called with an arbitrary node id so add the
   missing sanity check to handle potential attempts to destroy a
   non-existing node (patch 01/23). Reported by Dan Carpenter and the
   kernel test robot:

   https://lore.kernel.org/oe-kbuild/202302222118.nGz1F0oJ-lkp@intel.com/


Johan Hovold (23):
  interconnect: fix mem leak when freeing nodes
  interconnect: fix icc_provider_del() error handling
  interconnect: fix provider registration API
  interconnect: imx: fix registration race
  interconnect: qcom: osm-l3: fix registration race
  interconnect: qcom: rpm: fix probe child-node error handling
  interconnect: qcom: rpm: fix probe PM domain error handling
  interconnect: qcom: rpm: fix registration race
  interconnect: qcom: rpmh: fix probe child-node error handling
  interconnect: qcom: rpmh: fix registration race
  interconnect: qcom: msm8974: fix registration race
  interconnect: qcom: sm8450: fix registration race
  interconnect: qcom: sm8550: fix registration race
  interconnect: exynos: fix node leak in probe PM QoS error path
  interconnect: exynos: fix registration race
  interconnect: exynos: drop redundant link destroy
  memory: tegra: fix interconnect registration race
  memory: tegra124-emc: fix interconnect registration race
  memory: tegra20-emc: fix interconnect registration race
  memory: tegra30-emc: fix interconnect registration race
  interconnect: drop racy registration API
  interconnect: drop unused icc_get() interface
  interconnect: drop unused icc_link_destroy() interface

 drivers/interconnect/core.c           | 152 +++++---------------------
 drivers/interconnect/imx/imx.c        |  20 ++--
 drivers/interconnect/qcom/icc-rpm.c   |  33 +++---
 drivers/interconnect/qcom/icc-rpmh.c  |  30 +++--
 drivers/interconnect/qcom/msm8974.c   |  20 ++--
 drivers/interconnect/qcom/osm-l3.c    |  14 +--
 drivers/interconnect/qcom/sm8450.c    |  22 ++--
 drivers/interconnect/qcom/sm8550.c    |  22 ++--
 drivers/interconnect/samsung/exynos.c |  30 ++---
 drivers/memory/tegra/mc.c             |  16 ++-
 drivers/memory/tegra/tegra124-emc.c   |  12 +-
 drivers/memory/tegra/tegra20-emc.c    |  12 +-
 drivers/memory/tegra/tegra30-emc.c    |  12 +-
 include/linux/interconnect-provider.h |  19 ++--
 include/linux/interconnect.h          |   8 --
 15 files changed, 157 insertions(+), 265 deletions(-)

Comments

Abel Vesa March 6, 2023, 9:03 a.m. UTC | #1
On 23-03-06 08:56:41, Johan Hovold wrote:
> The current interconnect provider registration interface is inherently
> racy as nodes are not added until the after adding the provider. This
> can specifically cause racing DT lookups to fail.
> 
> Switch to using the new API where the provider is not registered until
> after it has been fully initialised.
> 
> Fixes: e6f0d6a30f73 ("interconnect: qcom: Add SM8550 interconnect provider driver")
> Reviewed-by: Abel Vesa <abel.vesa@linaro.org>
> Reviewed-by: Konrad Dybcio <konrad.dybcio@linaro.org>
> Signed-off-by: Johan Hovold <johan+linaro@kernel.org>
> ---

Any changes since v1 or is it just a resend? 

>  drivers/interconnect/qcom/sm8550.c | 22 +++++++++++-----------
>  1 file changed, 11 insertions(+), 11 deletions(-)
> 
> diff --git a/drivers/interconnect/qcom/sm8550.c b/drivers/interconnect/qcom/sm8550.c
> index 54fa027ab961..7ab492ca8fe0 100644
> --- a/drivers/interconnect/qcom/sm8550.c
> +++ b/drivers/interconnect/qcom/sm8550.c
> @@ -2197,9 +2197,10 @@ static int qnoc_probe(struct platform_device *pdev)
>  	provider->pre_aggregate = qcom_icc_pre_aggregate;
>  	provider->aggregate = qcom_icc_aggregate;
>  	provider->xlate_extended = qcom_icc_xlate_extended;
> -	INIT_LIST_HEAD(&provider->nodes);
>  	provider->data = data;
>  
> +	icc_provider_init(provider);
> +
>  	qp->dev = &pdev->dev;
>  	qp->bcms = desc->bcms;
>  	qp->num_bcms = desc->num_bcms;
> @@ -2208,12 +2209,6 @@ static int qnoc_probe(struct platform_device *pdev)
>  	if (IS_ERR(qp->voter))
>  		return PTR_ERR(qp->voter);
>  
> -	ret = icc_provider_add(provider);
> -	if (ret) {
> -		dev_err_probe(&pdev->dev, ret,
> -			      "error adding interconnect provider\n");
> -		return ret;
> -	}
>  
>  	for (i = 0; i < qp->num_bcms; i++)
>  		qcom_icc_bcm_init(qp->bcms[i], &pdev->dev);
> @@ -2227,7 +2222,7 @@ static int qnoc_probe(struct platform_device *pdev)
>  		node = icc_node_create(qnodes[i]->id);
>  		if (IS_ERR(node)) {
>  			ret = PTR_ERR(node);
> -			goto err;
> +			goto err_remove_nodes;
>  		}
>  
>  		node->name = qnodes[i]->name;
> @@ -2241,12 +2236,17 @@ static int qnoc_probe(struct platform_device *pdev)
>  	}
>  	data->num_nodes = num_nodes;
>  
> +	ret = icc_provider_register(provider);
> +	if (ret)
> +		goto err_remove_nodes;
> +
>  	platform_set_drvdata(pdev, qp);
>  
>  	return 0;
> -err:
> +
> +err_remove_nodes:
>  	icc_nodes_remove(provider);
> -	icc_provider_del(provider);
> +
>  	return ret;
>  }
>  
> @@ -2254,8 +2254,8 @@ static int qnoc_remove(struct platform_device *pdev)
>  {
>  	struct qcom_icc_provider *qp = platform_get_drvdata(pdev);
>  
> +	icc_provider_deregister(&qp->provider);
>  	icc_nodes_remove(&qp->provider);
> -	icc_provider_del(&qp->provider);
>  
>  	return 0;
>  }
> -- 
> 2.39.2
>
Johan Hovold March 6, 2023, 9:34 a.m. UTC | #2
On Mon, Mar 06, 2023 at 11:03:14AM +0200, Abel Vesa wrote:
> On 23-03-06 08:56:41, Johan Hovold wrote:
> > The current interconnect provider registration interface is inherently
> > racy as nodes are not added until the after adding the provider. This
> > can specifically cause racing DT lookups to fail.
> > 
> > Switch to using the new API where the provider is not registered until
> > after it has been fully initialised.
> > 
> > Fixes: e6f0d6a30f73 ("interconnect: qcom: Add SM8550 interconnect provider driver")
> > Reviewed-by: Abel Vesa <abel.vesa@linaro.org>
> > Reviewed-by: Konrad Dybcio <konrad.dybcio@linaro.org>
> > Signed-off-by: Johan Hovold <johan+linaro@kernel.org>
> > ---
> 
> Any changes since v1 or is it just a resend? 

Please see the cover letter:

	https://lore.kernel.org/lkml/20230306075651.2449-1-johan+linaro@kernel.org/

Only the first patch of the series was updated in v2.

Johan
Johan Hovold March 6, 2023, 10:04 a.m. UTC | #3
On Mon, Mar 06, 2023 at 11:39:33AM +0200, Abel Vesa wrote:
> On 23-03-06 10:34:34, Johan Hovold wrote:
> > On Mon, Mar 06, 2023 at 11:03:14AM +0200, Abel Vesa wrote:
> > > On 23-03-06 08:56:41, Johan Hovold wrote:
> > > > The current interconnect provider registration interface is inherently
> > > > racy as nodes are not added until the after adding the provider. This
> > > > can specifically cause racing DT lookups to fail.
> > > > 
> > > > Switch to using the new API where the provider is not registered until
> > > > after it has been fully initialised.
> > > > 
> > > > Fixes: e6f0d6a30f73 ("interconnect: qcom: Add SM8550 interconnect provider driver")
> > > > Reviewed-by: Abel Vesa <abel.vesa@linaro.org>
> > > > Reviewed-by: Konrad Dybcio <konrad.dybcio@linaro.org>
> > > > Signed-off-by: Johan Hovold <johan+linaro@kernel.org>
> > > > ---
> > > 
> > > Any changes since v1 or is it just a resend? 
> > 
> > Please see the cover letter:
> > 
> > 	https://lore.kernel.org/lkml/20230306075651.2449-1-johan+linaro@kernel.org/
> > 
> > Only the first patch of the series was updated in v2.
> 
> Right, my bad. Though I wasn't CC'ed on that as well.

Yeah, sorry about that. I copied the CC list from v1 and forgot to make
sure that all reviewers were copied on the cover letter.

Johan
Luca Ceresoli March 7, 2023, 9:42 a.m. UTC | #4
Hello Johan,

thanks for respinning.

On Mon,  6 Mar 2023 08:56:29 +0100
Johan Hovold <johan+linaro@kernel.org> wrote:

> The node link array is allocated when adding links to a node but is not
> deallocated when nodes are destroyed.
> 
> Fixes: 11f1ceca7031 ("interconnect: Add generic on-chip interconnect API")
> Cc: stable@vger.kernel.org      # 5.1
> Reviewed-by: Konrad Dybcio <konrad.dybcio@linaro.org>
> Signed-off-by: Johan Hovold <johan+linaro@kernel.org>

[Tested on i.MX8MP using an MSC SM2-MB-EP1 Board]
Tested-by: Luca Ceresoli <luca.ceresoli@bootlin.com>
Luca Ceresoli March 7, 2023, 9:43 a.m. UTC | #5
On Mon,  6 Mar 2023 08:56:30 +0100
Johan Hovold <johan+linaro@kernel.org> wrote:

> The interconnect framework currently expects that providers are only
> removed when there are no users and after all nodes have been removed.
> 
> There is currently nothing that guarantees this to be the case and the
> framework does not do any reference counting, but refusing to remove the
> provider is never correct as that would leave a dangling pointer to a
> resource that is about to be released in the global provider list (e.g.
> accessible through debugfs).
> 
> Replace the current sanity checks with WARN_ON() so that the provider is
> always removed.
> 
> Fixes: 11f1ceca7031 ("interconnect: Add generic on-chip interconnect API")
> Cc: stable@vger.kernel.org      # 5.1: 680f8666baf6: interconnect: Make icc_provider_del() return void
> Reviewed-by: Konrad Dybcio <konrad.dybcio@linaro.org>
> Signed-off-by: Johan Hovold <johan+linaro@kernel.org>

[Tested on i.MX8MP using an MSC SM2-MB-EP1 Board]
Tested-by: Luca Ceresoli <luca.ceresoli@bootlin.com>
Luca Ceresoli March 7, 2023, 9:43 a.m. UTC | #6
On Mon,  6 Mar 2023 08:56:31 +0100
Johan Hovold <johan+linaro@kernel.org> wrote:

> The current interconnect provider interface is inherently racy as
> providers are expected to be added before being fully initialised.
> 
> Specifically, nodes are currently not added and the provider data is not
> initialised until after registering the provider which can cause racing
> DT lookups to fail.
> 
> Add a new provider API which will be used to fix up the interconnect
> drivers.
> 
> The old API is reimplemented using the new interface and will be removed
> once all drivers have been fixed.
> 
> Fixes: 11f1ceca7031 ("interconnect: Add generic on-chip interconnect API")
> Fixes: 87e3031b6fbd ("interconnect: Allow endpoints translation via DT")
> Cc: stable@vger.kernel.org      # 5.1
> Reviewed-by: Konrad Dybcio <konrad.dybcio@linaro.org>
> Signed-off-by: Johan Hovold <johan+linaro@kernel.org>

[Tested on i.MX8MP using an MSC SM2-MB-EP1 Board]
Tested-by: Luca Ceresoli <luca.ceresoli@bootlin.com>