Message ID | 20231016095610.1095084-1-korneld@chromium.org |
---|---|
State | New |
Headers | show |
Series | mmc: cqhci: Be more verbose in error irq handler | expand |
On 16/10/23 12:56, Kornel Dulęba wrote: > There are several reasons for controller to generate an error interrupt. > They include controller<->card timeout, and CRC mismatch error. > Right now we only get one line in the logs stating that CQE recovery was > triggered, but with no information about what caused it. > To figure out what happened be more verbose and dump the registers from > irq error handler logic. > This matches the behaviour of the software timeout logic, see > cqhci_timeout. > > Signed-off-by: Kornel Dulęba <korneld@chromium.org> > --- > drivers/mmc/host/cqhci-core.c | 5 +++-- > 1 file changed, 3 insertions(+), 2 deletions(-) > > diff --git a/drivers/mmc/host/cqhci-core.c b/drivers/mmc/host/cqhci-core.c > index b3d7d6d8d654..33abb4bd53b5 100644 > --- a/drivers/mmc/host/cqhci-core.c > +++ b/drivers/mmc/host/cqhci-core.c > @@ -700,8 +700,9 @@ static void cqhci_error_irq(struct mmc_host *mmc, u32 status, int cmd_error, > > terri = cqhci_readl(cq_host, CQHCI_TERRI); > > - pr_debug("%s: cqhci: error IRQ status: 0x%08x cmd error %d data error %d TERRI: 0x%08x\n", > - mmc_hostname(mmc), status, cmd_error, data_error, terri); > + pr_warn("%s: cqhci: error IRQ status: 0x%08x cmd error %d data error %d\n", > + mmc_hostname(mmc), status, cmd_error, data_error); > + cqhci_dumpregs(cq_host); For debugging, isn't dynamic debug seems more appropriate? > > /* Forget about errors when recovery has already been triggered */ > if (cq_host->recovery_halt)
On Fri, Oct 20, 2023 at 9:41 AM Adrian Hunter <adrian.hunter@intel.com> wrote: > > On 16/10/23 12:56, Kornel Dulęba wrote: > > There are several reasons for controller to generate an error interrupt. > > They include controller<->card timeout, and CRC mismatch error. > > Right now we only get one line in the logs stating that CQE recovery was > > triggered, but with no information about what caused it. > > To figure out what happened be more verbose and dump the registers from > > irq error handler logic. > > This matches the behaviour of the software timeout logic, see > > cqhci_timeout. > > > > Signed-off-by: Kornel Dulęba <korneld@chromium.org> > > --- > > drivers/mmc/host/cqhci-core.c | 5 +++-- > > 1 file changed, 3 insertions(+), 2 deletions(-) > > > > diff --git a/drivers/mmc/host/cqhci-core.c b/drivers/mmc/host/cqhci-core.c > > index b3d7d6d8d654..33abb4bd53b5 100644 > > --- a/drivers/mmc/host/cqhci-core.c > > +++ b/drivers/mmc/host/cqhci-core.c > > @@ -700,8 +700,9 @@ static void cqhci_error_irq(struct mmc_host *mmc, u32 status, int cmd_error, > > > > terri = cqhci_readl(cq_host, CQHCI_TERRI); > > > > - pr_debug("%s: cqhci: error IRQ status: 0x%08x cmd error %d data error %d TERRI: 0x%08x\n", > > - mmc_hostname(mmc), status, cmd_error, data_error, terri); > > + pr_warn("%s: cqhci: error IRQ status: 0x%08x cmd error %d data error %d\n", > > + mmc_hostname(mmc), status, cmd_error, data_error); > > + cqhci_dumpregs(cq_host); > > For debugging, isn't dynamic debug seems more appropriate? Dynamic debug is an option, but my personal preference would be to just log more info in the error handler. To give you some background. We're seeing some "running CQE recovery" lines in the logs, followed by a dm_verity mismatch error. The reports come from the field, with no feasible way to reproduce the issue locally. I'd argue that logging only the info that CQE recovery was executed is not particularly helpful for someone looking into those logs. Ideally we would have more data about the state the controller was in when the error happened, or at least what caused the recovery to be triggered. The question here is how verbose should we be in this error scenario. Looking at other error scenarios, in the case of a software timeout we're dumping the controller registers. (cqhci_timeout) Hence I thought that I'd be appropriate to match that and do the same in CQE recovery logic. > > > > > /* Forget about errors when recovery has already been triggered */ > > if (cq_host->recovery_halt) >
On 20/10/23 11:53, Kornel Dulęba wrote: > On Fri, Oct 20, 2023 at 9:41 AM Adrian Hunter <adrian.hunter@intel.com> wrote: >> >> On 16/10/23 12:56, Kornel Dulęba wrote: >>> There are several reasons for controller to generate an error interrupt. >>> They include controller<->card timeout, and CRC mismatch error. >>> Right now we only get one line in the logs stating that CQE recovery was >>> triggered, but with no information about what caused it. >>> To figure out what happened be more verbose and dump the registers from >>> irq error handler logic. >>> This matches the behaviour of the software timeout logic, see >>> cqhci_timeout. >>> >>> Signed-off-by: Kornel Dulęba <korneld@chromium.org> >>> --- >>> drivers/mmc/host/cqhci-core.c | 5 +++-- >>> 1 file changed, 3 insertions(+), 2 deletions(-) >>> >>> diff --git a/drivers/mmc/host/cqhci-core.c b/drivers/mmc/host/cqhci-core.c >>> index b3d7d6d8d654..33abb4bd53b5 100644 >>> --- a/drivers/mmc/host/cqhci-core.c >>> +++ b/drivers/mmc/host/cqhci-core.c >>> @@ -700,8 +700,9 @@ static void cqhci_error_irq(struct mmc_host *mmc, u32 status, int cmd_error, >>> >>> terri = cqhci_readl(cq_host, CQHCI_TERRI); >>> >>> - pr_debug("%s: cqhci: error IRQ status: 0x%08x cmd error %d data error %d TERRI: 0x%08x\n", >>> - mmc_hostname(mmc), status, cmd_error, data_error, terri); >>> + pr_warn("%s: cqhci: error IRQ status: 0x%08x cmd error %d data error %d\n", >>> + mmc_hostname(mmc), status, cmd_error, data_error); >>> + cqhci_dumpregs(cq_host); >> >> For debugging, isn't dynamic debug seems more appropriate? > > Dynamic debug is an option, but my personal preference would be to > just log more info in the error handler. Interrupt handlers can get called very rapidly, so some kind of rate limiting should be used if the message is unconditional. Also you need to provide actual reasons for your preference. For dynamic debug of the register dump, something like below is possible. #define cqhci_dynamic_dumpregs(cqhost) \ _dynamic_func_call_no_desc("cqhci_dynamic_dumpregs", cqhci_dumpregs, cqhost) > To give you some background. > We're seeing some "running CQE recovery" lines in the logs, followed > by a dm_verity mismatch error. > The reports come from the field, with no feasible way to reproduce the > issue locally. If it is a software error, some kind of error injection may well reproduce it. Also if it is a hardware error that only happens during recovery, error injection could increase the likelihood of reproducing it. > > I'd argue that logging only the info that CQE recovery was executed is > not particularly helpful for someone looking into those logs. As the comment says, that message is there because recovery reduces performance, it is not to aid debugging per se. > Ideally we would have more data about the state the controller was in > when the error happened, or at least what caused the recovery to be > triggered. > The question here is how verbose should we be in this error scenario. > Looking at other error scenarios, in the case of a software timeout > we're dumping the controller registers. (cqhci_timeout) Timeout means something is broken - either the driver, the cq engine or the card. On the other hand, an error interrupt is most likely a CRC error which is not unexpected occasionally, due to thermal drift or perhaps interference. > Hence I thought that I'd be appropriate to match that and do the same > in CQE recovery logic. It needs to be consistent. There are other pr_debugs, such as: pr_debug("%s: cqhci: Failed to clear tasks\n", pr_debug("%s: cqhci: Failed to halt\n", mmc_hostname(mmc)); pr_debug("%s: cqhci: disable / re-enable\n", mmc_hostname(mmc)); which should perhaps be treated the same. And there are no messages for errors from the commands in mmc_cqe_recovery(). > >> >>> >>> /* Forget about errors when recovery has already been triggered */ >>> if (cq_host->recovery_halt) >>
On Mon, Oct 23, 2023 at 1:38 PM Adrian Hunter <adrian.hunter@intel.com> wrote: > > On 20/10/23 11:53, Kornel Dulęba wrote: > > On Fri, Oct 20, 2023 at 9:41 AM Adrian Hunter <adrian.hunter@intel.com> wrote: > >> > >> On 16/10/23 12:56, Kornel Dulęba wrote: > >>> There are several reasons for controller to generate an error interrupt. > >>> They include controller<->card timeout, and CRC mismatch error. > >>> Right now we only get one line in the logs stating that CQE recovery was > >>> triggered, but with no information about what caused it. > >>> To figure out what happened be more verbose and dump the registers from > >>> irq error handler logic. > >>> This matches the behaviour of the software timeout logic, see > >>> cqhci_timeout. > >>> > >>> Signed-off-by: Kornel Dulęba <korneld@chromium.org> > >>> --- > >>> drivers/mmc/host/cqhci-core.c | 5 +++-- > >>> 1 file changed, 3 insertions(+), 2 deletions(-) > >>> > >>> diff --git a/drivers/mmc/host/cqhci-core.c b/drivers/mmc/host/cqhci-core.c > >>> index b3d7d6d8d654..33abb4bd53b5 100644 > >>> --- a/drivers/mmc/host/cqhci-core.c > >>> +++ b/drivers/mmc/host/cqhci-core.c > >>> @@ -700,8 +700,9 @@ static void cqhci_error_irq(struct mmc_host *mmc, u32 status, int cmd_error, > >>> > >>> terri = cqhci_readl(cq_host, CQHCI_TERRI); > >>> > >>> - pr_debug("%s: cqhci: error IRQ status: 0x%08x cmd error %d data error %d TERRI: 0x%08x\n", > >>> - mmc_hostname(mmc), status, cmd_error, data_error, terri); > >>> + pr_warn("%s: cqhci: error IRQ status: 0x%08x cmd error %d data error %d\n", > >>> + mmc_hostname(mmc), status, cmd_error, data_error); > >>> + cqhci_dumpregs(cq_host); > >> > >> For debugging, isn't dynamic debug seems more appropriate? > > > > Dynamic debug is an option, but my personal preference would be to > > just log more info in the error handler. > > Interrupt handlers can get called very rapidly, so some kind of rate > limiting should be used if the message is unconditional. Also you need > to provide actual reasons for your preference. > > For dynamic debug of the register dump, something like below is > possible. > > #define cqhci_dynamic_dumpregs(cqhost) \ > _dynamic_func_call_no_desc("cqhci_dynamic_dumpregs", cqhci_dumpregs, cqhost) > Fair point. The reason I'm not a fan of using dynamic debug for this is that my goal here is to improve the warning/error logging information that we get from systems running in production. I.e. if we get a lot of "running CQE recovery" messages, at the very least I'd like to know what is causing them. > > To give you some background. > > We're seeing some "running CQE recovery" lines in the logs, followed > > by a dm_verity mismatch error. > > The reports come from the field, with no feasible way to reproduce the > > issue locally. > > If it is a software error, some kind of error injection may well > reproduce it. Also if it is a hardware error that only happens > during recovery, error injection could increase the likelihood of > reproducing it. We tried software injection and it didn't yield any results. We're currently looking into "injecting" hw errors by using a small burst field generator to interfere with transfers on the data line directly. > > > > > I'd argue that logging only the info that CQE recovery was executed is > > not particularly helpful for someone looking into those logs. > > As the comment says, that message is there because recovery reduces > performance, it is not to aid debugging per se. > > > Ideally we would have more data about the state the controller was in > > when the error happened, or at least what caused the recovery to be > > triggered. > > The question here is how verbose should we be in this error scenario. > > Looking at other error scenarios, in the case of a software timeout > > we're dumping the controller registers. (cqhci_timeout) > > Timeout means something is broken - either the driver, the cq engine > or the card. On the other hand, an error interrupt is most likely a > CRC error which is not unexpected occasionally, due to thermal drift > or perhaps interference. Right, but my point is that we don't know what triggered CQE recovery. > > > Hence I thought that I'd be appropriate to match that and do the same > > in CQE recovery logic. > > It needs to be consistent. There are other pr_debugs, such as: > > pr_debug("%s: cqhci: Failed to clear tasks\n", > pr_debug("%s: cqhci: Failed to halt\n", mmc_hostname(mmc)); > pr_debug("%s: cqhci: disable / re-enable\n", mmc_hostname(mmc)); > > which should perhaps be treated the same. > > And there are no messages for errors from the commands in > mmc_cqe_recovery(). How about this. As a compromise would it be okay to just do a single pr_warn directly from cqhci_error_irq. We could simply promote the existing pr_debug to pr_warn at the beginning of that function. This would tell us what triggered the recovery. (controller timeout, CRC mismatch) We can also consider removing the "running CQE recovery" print for the sake of brevity. The only downside of this that I can see is that we'd be running the logic from the interrupt handler directly, but I can't see an easy way around that. What do you think? > > > > >> > >>> > >>> /* Forget about errors when recovery has already been triggered */ > >>> if (cq_host->recovery_halt) > >> >
On 25/10/23 11:01, Kornel Dulęba wrote: > On Mon, Oct 23, 2023 at 1:38 PM Adrian Hunter <adrian.hunter@intel.com> wrote: >> >> On 20/10/23 11:53, Kornel Dulęba wrote: >>> On Fri, Oct 20, 2023 at 9:41 AM Adrian Hunter <adrian.hunter@intel.com> wrote: >>>> >>>> On 16/10/23 12:56, Kornel Dulęba wrote: >>>>> There are several reasons for controller to generate an error interrupt. >>>>> They include controller<->card timeout, and CRC mismatch error. >>>>> Right now we only get one line in the logs stating that CQE recovery was >>>>> triggered, but with no information about what caused it. >>>>> To figure out what happened be more verbose and dump the registers from >>>>> irq error handler logic. >>>>> This matches the behaviour of the software timeout logic, see >>>>> cqhci_timeout. >>>>> >>>>> Signed-off-by: Kornel Dulęba <korneld@chromium.org> >>>>> --- >>>>> drivers/mmc/host/cqhci-core.c | 5 +++-- >>>>> 1 file changed, 3 insertions(+), 2 deletions(-) >>>>> >>>>> diff --git a/drivers/mmc/host/cqhci-core.c b/drivers/mmc/host/cqhci-core.c >>>>> index b3d7d6d8d654..33abb4bd53b5 100644 >>>>> --- a/drivers/mmc/host/cqhci-core.c >>>>> +++ b/drivers/mmc/host/cqhci-core.c >>>>> @@ -700,8 +700,9 @@ static void cqhci_error_irq(struct mmc_host *mmc, u32 status, int cmd_error, >>>>> >>>>> terri = cqhci_readl(cq_host, CQHCI_TERRI); >>>>> >>>>> - pr_debug("%s: cqhci: error IRQ status: 0x%08x cmd error %d data error %d TERRI: 0x%08x\n", >>>>> - mmc_hostname(mmc), status, cmd_error, data_error, terri); >>>>> + pr_warn("%s: cqhci: error IRQ status: 0x%08x cmd error %d data error %d\n", >>>>> + mmc_hostname(mmc), status, cmd_error, data_error); >>>>> + cqhci_dumpregs(cq_host); >>>> >>>> For debugging, isn't dynamic debug seems more appropriate? >>> >>> Dynamic debug is an option, but my personal preference would be to >>> just log more info in the error handler. >> >> Interrupt handlers can get called very rapidly, so some kind of rate >> limiting should be used if the message is unconditional. Also you need >> to provide actual reasons for your preference. >> >> For dynamic debug of the register dump, something like below is >> possible. >> >> #define cqhci_dynamic_dumpregs(cqhost) \ >> _dynamic_func_call_no_desc("cqhci_dynamic_dumpregs", cqhci_dumpregs, cqhost) >> > Fair point. > The reason I'm not a fan of using dynamic debug for this is that my > goal here is to improve the warning/error logging information that we > get from systems running in production. > I.e. if we get a lot of "running CQE recovery" messages, at the very > least I'd like to know what is causing them. So you are saying you want to collect debug information from production systems, but don't want to use dynamic debug to do it? >>> To give you some background. >>> We're seeing some "running CQE recovery" lines in the logs, followed >>> by a dm_verity mismatch error. >>> The reports come from the field, with no feasible way to reproduce the >>> issue locally. >> >> If it is a software error, some kind of error injection may well >> reproduce it. Also if it is a hardware error that only happens >> during recovery, error injection could increase the likelihood of >> reproducing it. > > We tried software injection and it didn't yield any results. > We're currently looking into "injecting" hw errors by using a small > burst field generator to interfere with transfers on the data line > directly. I just tried instrumenting a driver to inject CRC errors and it revealed several CQE recovery issues, including spurious TCN for tag 31. I will send some patches when they are ready. >> >>> >>> I'd argue that logging only the info that CQE recovery was executed is >>> not particularly helpful for someone looking into those logs. >> >> As the comment says, that message is there because recovery reduces >> performance, it is not to aid debugging per se. >> >>> Ideally we would have more data about the state the controller was in >>> when the error happened, or at least what caused the recovery to be >>> triggered. >>> The question here is how verbose should we be in this error scenario. >>> Looking at other error scenarios, in the case of a software timeout >>> we're dumping the controller registers. (cqhci_timeout) >> >> Timeout means something is broken - either the driver, the cq engine >> or the card. On the other hand, an error interrupt is most likely a >> CRC error which is not unexpected occasionally, due to thermal drift >> or perhaps interference. > > Right, but my point is that we don't know what triggered CQE recovery. True, although probably a CRC error. > >> >>> Hence I thought that I'd be appropriate to match that and do the same >>> in CQE recovery logic. >> >> It needs to be consistent. There are other pr_debugs, such as: >> >> pr_debug("%s: cqhci: Failed to clear tasks\n", >> pr_debug("%s: cqhci: Failed to halt\n", mmc_hostname(mmc)); >> pr_debug("%s: cqhci: disable / re-enable\n", mmc_hostname(mmc)); >> >> which should perhaps be treated the same. >> >> And there are no messages for errors from the commands in >> mmc_cqe_recovery(). > > How about this. > As a compromise would it be okay to just do a single pr_warn directly > from cqhci_error_irq. Sure, printk_ratelimited() or __ratelimit() > We could simply promote the existing pr_debug to pr_warn at the > beginning of that function. > This would tell us what triggered the recovery. (controller timeout, > CRC mismatch) > We can also consider removing the "running CQE recovery" print for the > sake of brevity. No, that serves a different purpose. > The only downside of this that I can see is that we'd be running the > logic from the interrupt handler directly, but I can't see an easy way > around that. > What do you think? Should be OK with rate limiting. >> >>> >>>> >>>>> >>>>> /* Forget about errors when recovery has already been triggered */ >>>>> if (cq_host->recovery_halt) >>>> >>
On Thu, Oct 26, 2023 at 8:25 AM Adrian Hunter <adrian.hunter@intel.com> wrote: > > On 25/10/23 11:01, Kornel Dulęba wrote: > > On Mon, Oct 23, 2023 at 1:38 PM Adrian Hunter <adrian.hunter@intel.com> wrote: > >> > >> On 20/10/23 11:53, Kornel Dulęba wrote: > >>> On Fri, Oct 20, 2023 at 9:41 AM Adrian Hunter <adrian.hunter@intel.com> wrote: > >>>> > >>>> On 16/10/23 12:56, Kornel Dulęba wrote: > >>>>> There are several reasons for controller to generate an error interrupt. > >>>>> They include controller<->card timeout, and CRC mismatch error. > >>>>> Right now we only get one line in the logs stating that CQE recovery was > >>>>> triggered, but with no information about what caused it. > >>>>> To figure out what happened be more verbose and dump the registers from > >>>>> irq error handler logic. > >>>>> This matches the behaviour of the software timeout logic, see > >>>>> cqhci_timeout. > >>>>> > >>>>> Signed-off-by: Kornel Dulęba <korneld@chromium.org> > >>>>> --- > >>>>> drivers/mmc/host/cqhci-core.c | 5 +++-- > >>>>> 1 file changed, 3 insertions(+), 2 deletions(-) > >>>>> > >>>>> diff --git a/drivers/mmc/host/cqhci-core.c b/drivers/mmc/host/cqhci-core.c > >>>>> index b3d7d6d8d654..33abb4bd53b5 100644 > >>>>> --- a/drivers/mmc/host/cqhci-core.c > >>>>> +++ b/drivers/mmc/host/cqhci-core.c > >>>>> @@ -700,8 +700,9 @@ static void cqhci_error_irq(struct mmc_host *mmc, u32 status, int cmd_error, > >>>>> > >>>>> terri = cqhci_readl(cq_host, CQHCI_TERRI); > >>>>> > >>>>> - pr_debug("%s: cqhci: error IRQ status: 0x%08x cmd error %d data error %d TERRI: 0x%08x\n", > >>>>> - mmc_hostname(mmc), status, cmd_error, data_error, terri); > >>>>> + pr_warn("%s: cqhci: error IRQ status: 0x%08x cmd error %d data error %d\n", > >>>>> + mmc_hostname(mmc), status, cmd_error, data_error); > >>>>> + cqhci_dumpregs(cq_host); > >>>> > >>>> For debugging, isn't dynamic debug seems more appropriate? > >>> > >>> Dynamic debug is an option, but my personal preference would be to > >>> just log more info in the error handler. > >> > >> Interrupt handlers can get called very rapidly, so some kind of rate > >> limiting should be used if the message is unconditional. Also you need > >> to provide actual reasons for your preference. > >> > >> For dynamic debug of the register dump, something like below is > >> possible. > >> > >> #define cqhci_dynamic_dumpregs(cqhost) \ > >> _dynamic_func_call_no_desc("cqhci_dynamic_dumpregs", cqhci_dumpregs, cqhost) > >> > > Fair point. > > The reason I'm not a fan of using dynamic debug for this is that my > > goal here is to improve the warning/error logging information that we > > get from systems running in production. > > I.e. if we get a lot of "running CQE recovery" messages, at the very > > least I'd like to know what is causing them. > > So you are saying you want to collect debug information from production > systems, but don't want to use dynamic debug to do it? > > >>> To give you some background. > >>> We're seeing some "running CQE recovery" lines in the logs, followed > >>> by a dm_verity mismatch error. > >>> The reports come from the field, with no feasible way to reproduce the > >>> issue locally. > >> > >> If it is a software error, some kind of error injection may well > >> reproduce it. Also if it is a hardware error that only happens > >> during recovery, error injection could increase the likelihood of > >> reproducing it. > > > > We tried software injection and it didn't yield any results. > > We're currently looking into "injecting" hw errors by using a small > > burst field generator to interfere with transfers on the data line > > directly. > > I just tried instrumenting a driver to inject CRC errors and it > revealed several CQE recovery issues, including spurious TCN for > tag 31. I will send some patches when they are ready. Sorry, what I meant by it didn't yield results is that Ii didn't trigger the dm-verity error that we're seeing on production. With SW injection there are two potential issues that come to my mind: 1. In the cqhci_error_irq when TERRI is not valid only a single, "random" task is marked as bad. Then in cqhci_recover_mrq we're marking all pending requests as done. For data transfers this is somewhat bening as it will return with bytes_xfered=0. IIUC this will then cause the upper layer to re-enqueue this request. The bigger problem is a CMD only mrq, which will be mistakenly marked as completed successfully. 2. As for the spurious task completion warning. I initially thought that it was bening. The check for !mrq is done before checking if we're currently doing recovery. So if it's called just right at the end of recovery, right after the cqhci_recover_mrqs is executed that would explain it. With that being said if that irq handler is run right after the recovery is finished we'll end up with a race where a new request, that was just enqueued, might be mistakenly marked as done. This would explain the dm-verity errors we're seeing. > > >> > >>> > >>> I'd argue that logging only the info that CQE recovery was executed is > >>> not particularly helpful for someone looking into those logs. > >> > >> As the comment says, that message is there because recovery reduces > >> performance, it is not to aid debugging per se. > >> > >>> Ideally we would have more data about the state the controller was in > >>> when the error happened, or at least what caused the recovery to be > >>> triggered. > >>> The question here is how verbose should we be in this error scenario. > >>> Looking at other error scenarios, in the case of a software timeout > >>> we're dumping the controller registers. (cqhci_timeout) > >> > >> Timeout means something is broken - either the driver, the cq engine > >> or the card. On the other hand, an error interrupt is most likely a > >> CRC error which is not unexpected occasionally, due to thermal drift > >> or perhaps interference. > > > > Right, but my point is that we don't know what triggered CQE recovery. > > True, although probably a CRC error. > > > > >> > >>> Hence I thought that I'd be appropriate to match that and do the same > >>> in CQE recovery logic. > >> > >> It needs to be consistent. There are other pr_debugs, such as: > >> > >> pr_debug("%s: cqhci: Failed to clear tasks\n", > >> pr_debug("%s: cqhci: Failed to halt\n", mmc_hostname(mmc)); > >> pr_debug("%s: cqhci: disable / re-enable\n", mmc_hostname(mmc)); > >> > >> which should perhaps be treated the same. > >> > >> And there are no messages for errors from the commands in > >> mmc_cqe_recovery(). > > > > How about this. > > As a compromise would it be okay to just do a single pr_warn directly > > from cqhci_error_irq. > > Sure, printk_ratelimited() or __ratelimit() > > > We could simply promote the existing pr_debug to pr_warn at the > > beginning of that function. > > This would tell us what triggered the recovery. (controller timeout, > > CRC mismatch) > > We can also consider removing the "running CQE recovery" print for the > > sake of brevity. > > No, that serves a different purpose. > > > The only downside of this that I can see is that we'd be running the > > logic from the interrupt handler directly, but I can't see an easy way > > around that. > > What do you think? > > Should be OK with rate limiting. OK, I'll look into the rate limiting and will send a v2. > > >> > >>> > >>>> > >>>>> > >>>>> /* Forget about errors when recovery has already been triggered */ > >>>>> if (cq_host->recovery_halt) > >>>> > >> >
On 26/10/23 10:56, Kornel Dulęba wrote: > On Thu, Oct 26, 2023 at 8:25 AM Adrian Hunter <adrian.hunter@intel.com> wrote: >> >> On 25/10/23 11:01, Kornel Dulęba wrote: >>> On Mon, Oct 23, 2023 at 1:38 PM Adrian Hunter <adrian.hunter@intel.com> wrote: >>>> >>>> On 20/10/23 11:53, Kornel Dulęba wrote: >>>>> On Fri, Oct 20, 2023 at 9:41 AM Adrian Hunter <adrian.hunter@intel.com> wrote: >>>>>> >>>>>> On 16/10/23 12:56, Kornel Dulęba wrote: >>>>>>> There are several reasons for controller to generate an error interrupt. >>>>>>> They include controller<->card timeout, and CRC mismatch error. >>>>>>> Right now we only get one line in the logs stating that CQE recovery was >>>>>>> triggered, but with no information about what caused it. >>>>>>> To figure out what happened be more verbose and dump the registers from >>>>>>> irq error handler logic. >>>>>>> This matches the behaviour of the software timeout logic, see >>>>>>> cqhci_timeout. >>>>>>> >>>>>>> Signed-off-by: Kornel Dulęba <korneld@chromium.org> >>>>>>> --- >>>>>>> drivers/mmc/host/cqhci-core.c | 5 +++-- >>>>>>> 1 file changed, 3 insertions(+), 2 deletions(-) >>>>>>> >>>>>>> diff --git a/drivers/mmc/host/cqhci-core.c b/drivers/mmc/host/cqhci-core.c >>>>>>> index b3d7d6d8d654..33abb4bd53b5 100644 >>>>>>> --- a/drivers/mmc/host/cqhci-core.c >>>>>>> +++ b/drivers/mmc/host/cqhci-core.c >>>>>>> @@ -700,8 +700,9 @@ static void cqhci_error_irq(struct mmc_host *mmc, u32 status, int cmd_error, >>>>>>> >>>>>>> terri = cqhci_readl(cq_host, CQHCI_TERRI); >>>>>>> >>>>>>> - pr_debug("%s: cqhci: error IRQ status: 0x%08x cmd error %d data error %d TERRI: 0x%08x\n", >>>>>>> - mmc_hostname(mmc), status, cmd_error, data_error, terri); >>>>>>> + pr_warn("%s: cqhci: error IRQ status: 0x%08x cmd error %d data error %d\n", >>>>>>> + mmc_hostname(mmc), status, cmd_error, data_error); >>>>>>> + cqhci_dumpregs(cq_host); >>>>>> >>>>>> For debugging, isn't dynamic debug seems more appropriate? >>>>> >>>>> Dynamic debug is an option, but my personal preference would be to >>>>> just log more info in the error handler. >>>> >>>> Interrupt handlers can get called very rapidly, so some kind of rate >>>> limiting should be used if the message is unconditional. Also you need >>>> to provide actual reasons for your preference. >>>> >>>> For dynamic debug of the register dump, something like below is >>>> possible. >>>> >>>> #define cqhci_dynamic_dumpregs(cqhost) \ >>>> _dynamic_func_call_no_desc("cqhci_dynamic_dumpregs", cqhci_dumpregs, cqhost) >>>> >>> Fair point. >>> The reason I'm not a fan of using dynamic debug for this is that my >>> goal here is to improve the warning/error logging information that we >>> get from systems running in production. >>> I.e. if we get a lot of "running CQE recovery" messages, at the very >>> least I'd like to know what is causing them. >> >> So you are saying you want to collect debug information from production >> systems, but don't want to use dynamic debug to do it? >> >>>>> To give you some background. >>>>> We're seeing some "running CQE recovery" lines in the logs, followed >>>>> by a dm_verity mismatch error. >>>>> The reports come from the field, with no feasible way to reproduce the >>>>> issue locally. >>>> >>>> If it is a software error, some kind of error injection may well >>>> reproduce it. Also if it is a hardware error that only happens >>>> during recovery, error injection could increase the likelihood of >>>> reproducing it. >>> >>> We tried software injection and it didn't yield any results. >>> We're currently looking into "injecting" hw errors by using a small >>> burst field generator to interfere with transfers on the data line >>> directly. >> >> I just tried instrumenting a driver to inject CRC errors and it >> revealed several CQE recovery issues, including spurious TCN for >> tag 31. I will send some patches when they are ready. > > Sorry, what I meant by it didn't yield results is that Ii didn't > trigger the dm-verity error that we're seeing on production. > With SW injection there are two potential issues that come to my mind: > > 1. In the cqhci_error_irq when TERRI is not valid only a single, > "random" task is marked as bad. > Then in cqhci_recover_mrq we're marking all pending requests as done. > For data transfers this is somewhat bening as it will return with > bytes_xfered=0. > IIUC this will then cause the upper layer to re-enqueue this request. Yes > The bigger problem is a CMD only mrq, which will be mistakenly marked > as completed successfully. I noticed that also. Notably the only non-data CMD is cache flush. There are several other issues, but patches will describe them better. > > 2. As for the spurious task completion warning. > I initially thought that it was bening. > The check for !mrq is done before checking if we're currently doing recovery. > So if it's called just right at the end of recovery, right after the > cqhci_recover_mrqs is executed that would explain it. > With that being said if that irq handler is run right after the > recovery is finished we'll end up with a race where a new request, > that was just enqueued, might be mistakenly marked as done. > This would explain the dm-verity errors we're seeing. > >> >>>> >>>>> >>>>> I'd argue that logging only the info that CQE recovery was executed is >>>>> not particularly helpful for someone looking into those logs. >>>> >>>> As the comment says, that message is there because recovery reduces >>>> performance, it is not to aid debugging per se. >>>> >>>>> Ideally we would have more data about the state the controller was in >>>>> when the error happened, or at least what caused the recovery to be >>>>> triggered. >>>>> The question here is how verbose should we be in this error scenario. >>>>> Looking at other error scenarios, in the case of a software timeout >>>>> we're dumping the controller registers. (cqhci_timeout) >>>> >>>> Timeout means something is broken - either the driver, the cq engine >>>> or the card. On the other hand, an error interrupt is most likely a >>>> CRC error which is not unexpected occasionally, due to thermal drift >>>> or perhaps interference. >>> >>> Right, but my point is that we don't know what triggered CQE recovery. >> >> True, although probably a CRC error. >> >>> >>>> >>>>> Hence I thought that I'd be appropriate to match that and do the same >>>>> in CQE recovery logic. >>>> >>>> It needs to be consistent. There are other pr_debugs, such as: >>>> >>>> pr_debug("%s: cqhci: Failed to clear tasks\n", >>>> pr_debug("%s: cqhci: Failed to halt\n", mmc_hostname(mmc)); >>>> pr_debug("%s: cqhci: disable / re-enable\n", mmc_hostname(mmc)); >>>> >>>> which should perhaps be treated the same. >>>> >>>> And there are no messages for errors from the commands in >>>> mmc_cqe_recovery(). >>> >>> How about this. >>> As a compromise would it be okay to just do a single pr_warn directly >>> from cqhci_error_irq. >> >> Sure, printk_ratelimited() or __ratelimit() >> >>> We could simply promote the existing pr_debug to pr_warn at the >>> beginning of that function. >>> This would tell us what triggered the recovery. (controller timeout, >>> CRC mismatch) >>> We can also consider removing the "running CQE recovery" print for the >>> sake of brevity. >> >> No, that serves a different purpose. >> >>> The only downside of this that I can see is that we'd be running the >>> logic from the interrupt handler directly, but I can't see an easy way >>> around that. >>> What do you think? >> >> Should be OK with rate limiting. > > OK, I'll look into the rate limiting and will send a v2. > >> >>>> >>>>> >>>>>> >>>>>>> >>>>>>> /* Forget about errors when recovery has already been triggered */ >>>>>>> if (cq_host->recovery_halt) >>>>>> >>>> >>
On Thu, Oct 26, 2023 at 10:14 AM Adrian Hunter <adrian.hunter@intel.com> wrote: > > On 26/10/23 10:56, Kornel Dulęba wrote: > > On Thu, Oct 26, 2023 at 8:25 AM Adrian Hunter <adrian.hunter@intel.com> wrote: > >> > >> On 25/10/23 11:01, Kornel Dulęba wrote: > >>> On Mon, Oct 23, 2023 at 1:38 PM Adrian Hunter <adrian.hunter@intel.com> wrote: > >>>> > >>>> On 20/10/23 11:53, Kornel Dulęba wrote: > >>>>> On Fri, Oct 20, 2023 at 9:41 AM Adrian Hunter <adrian.hunter@intel.com> wrote: > >>>>>> > >>>>>> On 16/10/23 12:56, Kornel Dulęba wrote: > >>>>>>> There are several reasons for controller to generate an error interrupt. > >>>>>>> They include controller<->card timeout, and CRC mismatch error. > >>>>>>> Right now we only get one line in the logs stating that CQE recovery was > >>>>>>> triggered, but with no information about what caused it. > >>>>>>> To figure out what happened be more verbose and dump the registers from > >>>>>>> irq error handler logic. > >>>>>>> This matches the behaviour of the software timeout logic, see > >>>>>>> cqhci_timeout. > >>>>>>> > >>>>>>> Signed-off-by: Kornel Dulęba <korneld@chromium.org> > >>>>>>> --- > >>>>>>> drivers/mmc/host/cqhci-core.c | 5 +++-- > >>>>>>> 1 file changed, 3 insertions(+), 2 deletions(-) > >>>>>>> > >>>>>>> diff --git a/drivers/mmc/host/cqhci-core.c b/drivers/mmc/host/cqhci-core.c > >>>>>>> index b3d7d6d8d654..33abb4bd53b5 100644 > >>>>>>> --- a/drivers/mmc/host/cqhci-core.c > >>>>>>> +++ b/drivers/mmc/host/cqhci-core.c > >>>>>>> @@ -700,8 +700,9 @@ static void cqhci_error_irq(struct mmc_host *mmc, u32 status, int cmd_error, > >>>>>>> > >>>>>>> terri = cqhci_readl(cq_host, CQHCI_TERRI); > >>>>>>> > >>>>>>> - pr_debug("%s: cqhci: error IRQ status: 0x%08x cmd error %d data error %d TERRI: 0x%08x\n", > >>>>>>> - mmc_hostname(mmc), status, cmd_error, data_error, terri); > >>>>>>> + pr_warn("%s: cqhci: error IRQ status: 0x%08x cmd error %d data error %d\n", > >>>>>>> + mmc_hostname(mmc), status, cmd_error, data_error); > >>>>>>> + cqhci_dumpregs(cq_host); > >>>>>> > >>>>>> For debugging, isn't dynamic debug seems more appropriate? > >>>>> > >>>>> Dynamic debug is an option, but my personal preference would be to > >>>>> just log more info in the error handler. > >>>> > >>>> Interrupt handlers can get called very rapidly, so some kind of rate > >>>> limiting should be used if the message is unconditional. Also you need > >>>> to provide actual reasons for your preference. > >>>> > >>>> For dynamic debug of the register dump, something like below is > >>>> possible. > >>>> > >>>> #define cqhci_dynamic_dumpregs(cqhost) \ > >>>> _dynamic_func_call_no_desc("cqhci_dynamic_dumpregs", cqhci_dumpregs, cqhost) > >>>> > >>> Fair point. > >>> The reason I'm not a fan of using dynamic debug for this is that my > >>> goal here is to improve the warning/error logging information that we > >>> get from systems running in production. > >>> I.e. if we get a lot of "running CQE recovery" messages, at the very > >>> least I'd like to know what is causing them. > >> > >> So you are saying you want to collect debug information from production > >> systems, but don't want to use dynamic debug to do it? > >> > >>>>> To give you some background. > >>>>> We're seeing some "running CQE recovery" lines in the logs, followed > >>>>> by a dm_verity mismatch error. > >>>>> The reports come from the field, with no feasible way to reproduce the > >>>>> issue locally. > >>>> > >>>> If it is a software error, some kind of error injection may well > >>>> reproduce it. Also if it is a hardware error that only happens > >>>> during recovery, error injection could increase the likelihood of > >>>> reproducing it. > >>> > >>> We tried software injection and it didn't yield any results. > >>> We're currently looking into "injecting" hw errors by using a small > >>> burst field generator to interfere with transfers on the data line > >>> directly. > >> > >> I just tried instrumenting a driver to inject CRC errors and it > >> revealed several CQE recovery issues, including spurious TCN for > >> tag 31. I will send some patches when they are ready. > > > > Sorry, what I meant by it didn't yield results is that Ii didn't > > trigger the dm-verity error that we're seeing on production. > > With SW injection there are two potential issues that come to my mind: > > > > 1. In the cqhci_error_irq when TERRI is not valid only a single, > > "random" task is marked as bad. > > Then in cqhci_recover_mrq we're marking all pending requests as done. > > For data transfers this is somewhat bening as it will return with > > bytes_xfered=0. > > IIUC this will then cause the upper layer to re-enqueue this request. > > Yes > > > The bigger problem is a CMD only mrq, which will be mistakenly marked > > as completed successfully. > > I noticed that also. Notably the only non-data CMD is cache flush. > > There are several other issues, but patches will describe > them better. Sure. :) jfyi, I've just managed to inject CRC errors by using a burst field generator. (Langer P23 if you're interested.) I'm using it by touching the D0 MMC line directly, and it yields surprisingly good results. I've changed the spurious TCN WARN_ONCE to pr_warn and got the following: [ 71.885698] mmc1: cqhci: error IRQ status: 0x00000000 cmd error 0 data error -84 TERRI: 0x972e0000 [ 71.885730] mmc1: running CQE recovery [ 71.888135] cqhci_recovery_finish: TCN: 0x00000000 [ 71.888141] mmc1: cqhci: recovery done [ 71.888223] mmc1: cqhci: spurious TCN for tag 23 (...) [ 95.558736] mmc1: cqhci: error IRQ status: 0x00000000 cmd error 0 data error -84 TERRI: 0x822e0000 [ 95.558768] mmc1: running CQE recovery [ 95.561073] cqhci_recovery_finish: TCN: 0x00000000 [ 95.561078] mmc1: cqhci: recovery done [ 95.561288] device-mapper: verity: 179:3: data block 712181 is corrupted Now I get a spurious TCN after every recovery, with the only exception being the one that ends up with dm-verity error. So it'd seem that there's a race in which the "spurious" TCN hits a pending request, enqueued right after recovery was completed. I'm currently looking into how to fix it, but if you beat me to it I can also test your patches and see if it fixes the dm-verity issue. > > > > > 2. As for the spurious task completion warning. > > I initially thought that it was bening. > > The check for !mrq is done before checking if we're currently doing recovery. > > So if it's called just right at the end of recovery, right after the > > cqhci_recover_mrqs is executed that would explain it. > > With that being said if that irq handler is run right after the > > recovery is finished we'll end up with a race where a new request, > > that was just enqueued, might be mistakenly marked as done. > > This would explain the dm-verity errors we're seeing. > > > >> > >>>> > >>>>> > >>>>> I'd argue that logging only the info that CQE recovery was executed is > >>>>> not particularly helpful for someone looking into those logs. > >>>> > >>>> As the comment says, that message is there because recovery reduces > >>>> performance, it is not to aid debugging per se. > >>>> > >>>>> Ideally we would have more data about the state the controller was in > >>>>> when the error happened, or at least what caused the recovery to be > >>>>> triggered. > >>>>> The question here is how verbose should we be in this error scenario. > >>>>> Looking at other error scenarios, in the case of a software timeout > >>>>> we're dumping the controller registers. (cqhci_timeout) > >>>> > >>>> Timeout means something is broken - either the driver, the cq engine > >>>> or the card. On the other hand, an error interrupt is most likely a > >>>> CRC error which is not unexpected occasionally, due to thermal drift > >>>> or perhaps interference. > >>> > >>> Right, but my point is that we don't know what triggered CQE recovery. > >> > >> True, although probably a CRC error. > >> > >>> > >>>> > >>>>> Hence I thought that I'd be appropriate to match that and do the same > >>>>> in CQE recovery logic. > >>>> > >>>> It needs to be consistent. There are other pr_debugs, such as: > >>>> > >>>> pr_debug("%s: cqhci: Failed to clear tasks\n", > >>>> pr_debug("%s: cqhci: Failed to halt\n", mmc_hostname(mmc)); > >>>> pr_debug("%s: cqhci: disable / re-enable\n", mmc_hostname(mmc)); > >>>> > >>>> which should perhaps be treated the same. > >>>> > >>>> And there are no messages for errors from the commands in > >>>> mmc_cqe_recovery(). > >>> > >>> How about this. > >>> As a compromise would it be okay to just do a single pr_warn directly > >>> from cqhci_error_irq. > >> > >> Sure, printk_ratelimited() or __ratelimit() > >> > >>> We could simply promote the existing pr_debug to pr_warn at the > >>> beginning of that function. > >>> This would tell us what triggered the recovery. (controller timeout, > >>> CRC mismatch) > >>> We can also consider removing the "running CQE recovery" print for the > >>> sake of brevity. > >> > >> No, that serves a different purpose. > >> > >>> The only downside of this that I can see is that we'd be running the > >>> logic from the interrupt handler directly, but I can't see an easy way > >>> around that. > >>> What do you think? > >> > >> Should be OK with rate limiting. > > > > OK, I'll look into the rate limiting and will send a v2. > > > >> > >>>> > >>>>> > >>>>>> > >>>>>>> > >>>>>>> /* Forget about errors when recovery has already been triggered */ > >>>>>>> if (cq_host->recovery_halt) > >>>>>> > >>>> > >> >
On 26/10/23 11:29, Kornel Dulęba wrote: > On Thu, Oct 26, 2023 at 10:14 AM Adrian Hunter <adrian.hunter@intel.com> wrote: >> >> On 26/10/23 10:56, Kornel Dulęba wrote: >>> On Thu, Oct 26, 2023 at 8:25 AM Adrian Hunter <adrian.hunter@intel.com> wrote: >>>> >>>> On 25/10/23 11:01, Kornel Dulęba wrote: >>>>> On Mon, Oct 23, 2023 at 1:38 PM Adrian Hunter <adrian.hunter@intel.com> wrote: >>>>>> >>>>>> On 20/10/23 11:53, Kornel Dulęba wrote: >>>>>>> On Fri, Oct 20, 2023 at 9:41 AM Adrian Hunter <adrian.hunter@intel.com> wrote: >>>>>>>> >>>>>>>> On 16/10/23 12:56, Kornel Dulęba wrote: >>>>>>>>> There are several reasons for controller to generate an error interrupt. >>>>>>>>> They include controller<->card timeout, and CRC mismatch error. >>>>>>>>> Right now we only get one line in the logs stating that CQE recovery was >>>>>>>>> triggered, but with no information about what caused it. >>>>>>>>> To figure out what happened be more verbose and dump the registers from >>>>>>>>> irq error handler logic. >>>>>>>>> This matches the behaviour of the software timeout logic, see >>>>>>>>> cqhci_timeout. >>>>>>>>> >>>>>>>>> Signed-off-by: Kornel Dulęba <korneld@chromium.org> >>>>>>>>> --- >>>>>>>>> drivers/mmc/host/cqhci-core.c | 5 +++-- >>>>>>>>> 1 file changed, 3 insertions(+), 2 deletions(-) >>>>>>>>> >>>>>>>>> diff --git a/drivers/mmc/host/cqhci-core.c b/drivers/mmc/host/cqhci-core.c >>>>>>>>> index b3d7d6d8d654..33abb4bd53b5 100644 >>>>>>>>> --- a/drivers/mmc/host/cqhci-core.c >>>>>>>>> +++ b/drivers/mmc/host/cqhci-core.c >>>>>>>>> @@ -700,8 +700,9 @@ static void cqhci_error_irq(struct mmc_host *mmc, u32 status, int cmd_error, >>>>>>>>> >>>>>>>>> terri = cqhci_readl(cq_host, CQHCI_TERRI); >>>>>>>>> >>>>>>>>> - pr_debug("%s: cqhci: error IRQ status: 0x%08x cmd error %d data error %d TERRI: 0x%08x\n", >>>>>>>>> - mmc_hostname(mmc), status, cmd_error, data_error, terri); >>>>>>>>> + pr_warn("%s: cqhci: error IRQ status: 0x%08x cmd error %d data error %d\n", >>>>>>>>> + mmc_hostname(mmc), status, cmd_error, data_error); >>>>>>>>> + cqhci_dumpregs(cq_host); >>>>>>>> >>>>>>>> For debugging, isn't dynamic debug seems more appropriate? >>>>>>> >>>>>>> Dynamic debug is an option, but my personal preference would be to >>>>>>> just log more info in the error handler. >>>>>> >>>>>> Interrupt handlers can get called very rapidly, so some kind of rate >>>>>> limiting should be used if the message is unconditional. Also you need >>>>>> to provide actual reasons for your preference. >>>>>> >>>>>> For dynamic debug of the register dump, something like below is >>>>>> possible. >>>>>> >>>>>> #define cqhci_dynamic_dumpregs(cqhost) \ >>>>>> _dynamic_func_call_no_desc("cqhci_dynamic_dumpregs", cqhci_dumpregs, cqhost) >>>>>> >>>>> Fair point. >>>>> The reason I'm not a fan of using dynamic debug for this is that my >>>>> goal here is to improve the warning/error logging information that we >>>>> get from systems running in production. >>>>> I.e. if we get a lot of "running CQE recovery" messages, at the very >>>>> least I'd like to know what is causing them. >>>> >>>> So you are saying you want to collect debug information from production >>>> systems, but don't want to use dynamic debug to do it? >>>> >>>>>>> To give you some background. >>>>>>> We're seeing some "running CQE recovery" lines in the logs, followed >>>>>>> by a dm_verity mismatch error. >>>>>>> The reports come from the field, with no feasible way to reproduce the >>>>>>> issue locally. >>>>>> >>>>>> If it is a software error, some kind of error injection may well >>>>>> reproduce it. Also if it is a hardware error that only happens >>>>>> during recovery, error injection could increase the likelihood of >>>>>> reproducing it. >>>>> >>>>> We tried software injection and it didn't yield any results. >>>>> We're currently looking into "injecting" hw errors by using a small >>>>> burst field generator to interfere with transfers on the data line >>>>> directly. >>>> >>>> I just tried instrumenting a driver to inject CRC errors and it >>>> revealed several CQE recovery issues, including spurious TCN for >>>> tag 31. I will send some patches when they are ready. >>> >>> Sorry, what I meant by it didn't yield results is that Ii didn't >>> trigger the dm-verity error that we're seeing on production. >>> With SW injection there are two potential issues that come to my mind: >>> >>> 1. In the cqhci_error_irq when TERRI is not valid only a single, >>> "random" task is marked as bad. >>> Then in cqhci_recover_mrq we're marking all pending requests as done. >>> For data transfers this is somewhat bening as it will return with >>> bytes_xfered=0. >>> IIUC this will then cause the upper layer to re-enqueue this request. >> >> Yes >> >>> The bigger problem is a CMD only mrq, which will be mistakenly marked >>> as completed successfully. >> >> I noticed that also. Notably the only non-data CMD is cache flush. >> >> There are several other issues, but patches will describe >> them better. > > Sure. :) > > jfyi, I've just managed to inject CRC errors by using a burst field generator. > (Langer P23 if you're interested.) > I'm using it by touching the D0 MMC line directly, and it yields > surprisingly good results. > I've changed the spurious TCN WARN_ONCE to pr_warn and got the following: > > [ 71.885698] mmc1: cqhci: error IRQ status: 0x00000000 cmd error 0 > data error -84 TERRI: 0x972e0000 > [ 71.885730] mmc1: running CQE recovery > [ 71.888135] cqhci_recovery_finish: TCN: 0x00000000 > [ 71.888141] mmc1: cqhci: recovery done > [ 71.888223] mmc1: cqhci: spurious TCN for tag 23 > (...) > [ 95.558736] mmc1: cqhci: error IRQ status: 0x00000000 cmd error 0 > data error -84 TERRI: 0x822e0000 > [ 95.558768] mmc1: running CQE recovery > [ 95.561073] cqhci_recovery_finish: TCN: 0x00000000 > [ 95.561078] mmc1: cqhci: recovery done > [ 95.561288] device-mapper: verity: 179:3: data block 712181 is corrupted > > Now I get a spurious TCN after every recovery, with the only exception > being the one that ends up with dm-verity error. > So it'd seem that there's a race in which the "spurious" TCN hits a > pending request, enqueued right after recovery was completed. > I'm currently looking into how to fix it, but if you beat me to it I > can also test your patches and see if it fixes the dm-verity issue. OK, but here are some hacks to try: diff --git a/drivers/mmc/core/block.c b/drivers/mmc/core/block.c index 3a8f27c3e310..5eb98da933d5 100644 --- a/drivers/mmc/core/block.c +++ b/drivers/mmc/core/block.c @@ -2379,6 +2379,9 @@ enum mmc_issued mmc_blk_mq_issue_rq(struct mmc_queue *mq, struct request *req) blk_mq_end_request(req, BLK_STS_OK); return MMC_REQ_FINISHED; } + ret = mmc_blk_wait_for_idle(mq, host); + if (ret) + return MMC_REQ_BUSY; ret = mmc_blk_cqe_issue_flush(mq, req); break; case REQ_OP_READ: diff --git a/drivers/mmc/core/core.c b/drivers/mmc/core/core.c index 3d3e0ca52614..8ac1efca2be6 100644 --- a/drivers/mmc/core/core.c +++ b/drivers/mmc/core/core.c @@ -563,6 +563,16 @@ int mmc_cqe_recovery(struct mmc_host *host) host->cqe_ops->cqe_recovery_finish(host); + if (err) { + memset(&cmd, 0, sizeof(cmd)); + cmd.opcode = MMC_CMDQ_TASK_MGMT; + cmd.arg = 1; /* Discard entire queue */ + cmd.flags = MMC_RSP_R1B | MMC_CMD_AC; + cmd.flags &= ~MMC_RSP_CRC; /* Ignore CRC */ + cmd.busy_timeout = MMC_CQE_RECOVERY_TIMEOUT; + err = mmc_wait_for_cmd(host, &cmd, 3); + } + mmc_retune_release(host); return err; diff --git a/drivers/mmc/host/cqhci-core.c b/drivers/mmc/host/cqhci-core.c index b3d7d6d8d654..93e1bbdc446f 100644 --- a/drivers/mmc/host/cqhci-core.c +++ b/drivers/mmc/host/cqhci-core.c @@ -975,8 +975,13 @@ static bool cqhci_halt(struct mmc_host *mmc, unsigned int timeout) ret = cqhci_halted(cq_host); - if (!ret) - pr_debug("%s: cqhci: Failed to halt\n", mmc_hostname(mmc)); + if (!ret) { + u32 cqcfg; + + cqcfg = cqhci_readl(cq_host, CQHCI_CFG); + cqcfg &= ~CQHCI_ENABLE; + cqhci_writel(cq_host, cqcfg, CQHCI_CFG); + } return ret; } @@ -987,7 +992,7 @@ static bool cqhci_halt(struct mmc_host *mmc, unsigned int timeout) * layers will need to send a STOP command), so we set the timeout based on a * generous command timeout. */ -#define CQHCI_START_HALT_TIMEOUT 5 +#define CQHCI_START_HALT_TIMEOUT 200 static void cqhci_recovery_start(struct mmc_host *mmc) { @@ -1039,6 +1044,8 @@ static void cqhci_recover_mrq(struct cqhci_host *cq_host, unsigned int tag) data->error = cqhci_error_from_flags(slot->flags); } else { mrq->cmd->error = cqhci_error_from_flags(slot->flags); + if (!mrq->cmd->error) + mrq->cmd->error = -EIO; } mmc_cqe_request_done(cq_host->mmc, mrq); > >> >>> >>> 2. As for the spurious task completion warning. >>> I initially thought that it was bening. >>> The check for !mrq is done before checking if we're currently doing recovery. >>> So if it's called just right at the end of recovery, right after the >>> cqhci_recover_mrqs is executed that would explain it. >>> With that being said if that irq handler is run right after the >>> recovery is finished we'll end up with a race where a new request, >>> that was just enqueued, might be mistakenly marked as done. >>> This would explain the dm-verity errors we're seeing. >>> >>>> >>>>>> >>>>>>> >>>>>>> I'd argue that logging only the info that CQE recovery was executed is >>>>>>> not particularly helpful for someone looking into those logs. >>>>>> >>>>>> As the comment says, that message is there because recovery reduces >>>>>> performance, it is not to aid debugging per se. >>>>>> >>>>>>> Ideally we would have more data about the state the controller was in >>>>>>> when the error happened, or at least what caused the recovery to be >>>>>>> triggered. >>>>>>> The question here is how verbose should we be in this error scenario. >>>>>>> Looking at other error scenarios, in the case of a software timeout >>>>>>> we're dumping the controller registers. (cqhci_timeout) >>>>>> >>>>>> Timeout means something is broken - either the driver, the cq engine >>>>>> or the card. On the other hand, an error interrupt is most likely a >>>>>> CRC error which is not unexpected occasionally, due to thermal drift >>>>>> or perhaps interference. >>>>> >>>>> Right, but my point is that we don't know what triggered CQE recovery. >>>> >>>> True, although probably a CRC error. >>>> >>>>> >>>>>> >>>>>>> Hence I thought that I'd be appropriate to match that and do the same >>>>>>> in CQE recovery logic. >>>>>> >>>>>> It needs to be consistent. There are other pr_debugs, such as: >>>>>> >>>>>> pr_debug("%s: cqhci: Failed to clear tasks\n", >>>>>> pr_debug("%s: cqhci: Failed to halt\n", mmc_hostname(mmc)); >>>>>> pr_debug("%s: cqhci: disable / re-enable\n", mmc_hostname(mmc)); >>>>>> >>>>>> which should perhaps be treated the same. >>>>>> >>>>>> And there are no messages for errors from the commands in >>>>>> mmc_cqe_recovery(). >>>>> >>>>> How about this. >>>>> As a compromise would it be okay to just do a single pr_warn directly >>>>> from cqhci_error_irq. >>>> >>>> Sure, printk_ratelimited() or __ratelimit() >>>> >>>>> We could simply promote the existing pr_debug to pr_warn at the >>>>> beginning of that function. >>>>> This would tell us what triggered the recovery. (controller timeout, >>>>> CRC mismatch) >>>>> We can also consider removing the "running CQE recovery" print for the >>>>> sake of brevity. >>>> >>>> No, that serves a different purpose. >>>> >>>>> The only downside of this that I can see is that we'd be running the >>>>> logic from the interrupt handler directly, but I can't see an easy way >>>>> around that. >>>>> What do you think? >>>> >>>> Should be OK with rate limiting. >>> >>> OK, I'll look into the rate limiting and will send a v2. >>> >>>> >>>>>> >>>>>>> >>>>>>>> >>>>>>>>> >>>>>>>>> /* Forget about errors when recovery has already been triggered */ >>>>>>>>> if (cq_host->recovery_halt) >>>>>>>> >>>>>> >>>> >>
On Thu, Oct 26, 2023 at 10:52 AM Adrian Hunter <adrian.hunter@intel.com> wrote: > > On 26/10/23 11:29, Kornel Dulęba wrote: > > On Thu, Oct 26, 2023 at 10:14 AM Adrian Hunter <adrian.hunter@intel.com> wrote: > >> > >> On 26/10/23 10:56, Kornel Dulęba wrote: > >>> On Thu, Oct 26, 2023 at 8:25 AM Adrian Hunter <adrian.hunter@intel.com> wrote: > >>>> > >>>> On 25/10/23 11:01, Kornel Dulęba wrote: > >>>>> On Mon, Oct 23, 2023 at 1:38 PM Adrian Hunter <adrian.hunter@intel.com> wrote: > >>>>>> > >>>>>> On 20/10/23 11:53, Kornel Dulęba wrote: > >>>>>>> On Fri, Oct 20, 2023 at 9:41 AM Adrian Hunter <adrian.hunter@intel.com> wrote: > >>>>>>>> > >>>>>>>> On 16/10/23 12:56, Kornel Dulęba wrote: > >>>>>>>>> There are several reasons for controller to generate an error interrupt. > >>>>>>>>> They include controller<->card timeout, and CRC mismatch error. > >>>>>>>>> Right now we only get one line in the logs stating that CQE recovery was > >>>>>>>>> triggered, but with no information about what caused it. > >>>>>>>>> To figure out what happened be more verbose and dump the registers from > >>>>>>>>> irq error handler logic. > >>>>>>>>> This matches the behaviour of the software timeout logic, see > >>>>>>>>> cqhci_timeout. > >>>>>>>>> > >>>>>>>>> Signed-off-by: Kornel Dulęba <korneld@chromium.org> > >>>>>>>>> --- > >>>>>>>>> drivers/mmc/host/cqhci-core.c | 5 +++-- > >>>>>>>>> 1 file changed, 3 insertions(+), 2 deletions(-) > >>>>>>>>> > >>>>>>>>> diff --git a/drivers/mmc/host/cqhci-core.c b/drivers/mmc/host/cqhci-core.c > >>>>>>>>> index b3d7d6d8d654..33abb4bd53b5 100644 > >>>>>>>>> --- a/drivers/mmc/host/cqhci-core.c > >>>>>>>>> +++ b/drivers/mmc/host/cqhci-core.c > >>>>>>>>> @@ -700,8 +700,9 @@ static void cqhci_error_irq(struct mmc_host *mmc, u32 status, int cmd_error, > >>>>>>>>> > >>>>>>>>> terri = cqhci_readl(cq_host, CQHCI_TERRI); > >>>>>>>>> > >>>>>>>>> - pr_debug("%s: cqhci: error IRQ status: 0x%08x cmd error %d data error %d TERRI: 0x%08x\n", > >>>>>>>>> - mmc_hostname(mmc), status, cmd_error, data_error, terri); > >>>>>>>>> + pr_warn("%s: cqhci: error IRQ status: 0x%08x cmd error %d data error %d\n", > >>>>>>>>> + mmc_hostname(mmc), status, cmd_error, data_error); > >>>>>>>>> + cqhci_dumpregs(cq_host); > >>>>>>>> > >>>>>>>> For debugging, isn't dynamic debug seems more appropriate? > >>>>>>> > >>>>>>> Dynamic debug is an option, but my personal preference would be to > >>>>>>> just log more info in the error handler. > >>>>>> > >>>>>> Interrupt handlers can get called very rapidly, so some kind of rate > >>>>>> limiting should be used if the message is unconditional. Also you need > >>>>>> to provide actual reasons for your preference. > >>>>>> > >>>>>> For dynamic debug of the register dump, something like below is > >>>>>> possible. > >>>>>> > >>>>>> #define cqhci_dynamic_dumpregs(cqhost) \ > >>>>>> _dynamic_func_call_no_desc("cqhci_dynamic_dumpregs", cqhci_dumpregs, cqhost) > >>>>>> > >>>>> Fair point. > >>>>> The reason I'm not a fan of using dynamic debug for this is that my > >>>>> goal here is to improve the warning/error logging information that we > >>>>> get from systems running in production. > >>>>> I.e. if we get a lot of "running CQE recovery" messages, at the very > >>>>> least I'd like to know what is causing them. > >>>> > >>>> So you are saying you want to collect debug information from production > >>>> systems, but don't want to use dynamic debug to do it? > >>>> > >>>>>>> To give you some background. > >>>>>>> We're seeing some "running CQE recovery" lines in the logs, followed > >>>>>>> by a dm_verity mismatch error. > >>>>>>> The reports come from the field, with no feasible way to reproduce the > >>>>>>> issue locally. > >>>>>> > >>>>>> If it is a software error, some kind of error injection may well > >>>>>> reproduce it. Also if it is a hardware error that only happens > >>>>>> during recovery, error injection could increase the likelihood of > >>>>>> reproducing it. > >>>>> > >>>>> We tried software injection and it didn't yield any results. > >>>>> We're currently looking into "injecting" hw errors by using a small > >>>>> burst field generator to interfere with transfers on the data line > >>>>> directly. > >>>> > >>>> I just tried instrumenting a driver to inject CRC errors and it > >>>> revealed several CQE recovery issues, including spurious TCN for > >>>> tag 31. I will send some patches when they are ready. > >>> > >>> Sorry, what I meant by it didn't yield results is that Ii didn't > >>> trigger the dm-verity error that we're seeing on production. > >>> With SW injection there are two potential issues that come to my mind: > >>> > >>> 1. In the cqhci_error_irq when TERRI is not valid only a single, > >>> "random" task is marked as bad. > >>> Then in cqhci_recover_mrq we're marking all pending requests as done. > >>> For data transfers this is somewhat bening as it will return with > >>> bytes_xfered=0. > >>> IIUC this will then cause the upper layer to re-enqueue this request. > >> > >> Yes > >> > >>> The bigger problem is a CMD only mrq, which will be mistakenly marked > >>> as completed successfully. > >> > >> I noticed that also. Notably the only non-data CMD is cache flush. > >> > >> There are several other issues, but patches will describe > >> them better. > > > > Sure. :) > > > > jfyi, I've just managed to inject CRC errors by using a burst field generator. > > (Langer P23 if you're interested.) > > I'm using it by touching the D0 MMC line directly, and it yields > > surprisingly good results. > > I've changed the spurious TCN WARN_ONCE to pr_warn and got the following: > > > > [ 71.885698] mmc1: cqhci: error IRQ status: 0x00000000 cmd error 0 > > data error -84 TERRI: 0x972e0000 > > [ 71.885730] mmc1: running CQE recovery > > [ 71.888135] cqhci_recovery_finish: TCN: 0x00000000 > > [ 71.888141] mmc1: cqhci: recovery done > > [ 71.888223] mmc1: cqhci: spurious TCN for tag 23 > > (...) > > [ 95.558736] mmc1: cqhci: error IRQ status: 0x00000000 cmd error 0 > > data error -84 TERRI: 0x822e0000 > > [ 95.558768] mmc1: running CQE recovery > > [ 95.561073] cqhci_recovery_finish: TCN: 0x00000000 > > [ 95.561078] mmc1: cqhci: recovery done > > [ 95.561288] device-mapper: verity: 179:3: data block 712181 is corrupted > > > > Now I get a spurious TCN after every recovery, with the only exception > > being the one that ends up with dm-verity error. > > So it'd seem that there's a race in which the "spurious" TCN hits a > > pending request, enqueued right after recovery was completed. > > I'm currently looking into how to fix it, but if you beat me to it I > > can also test your patches and see if it fixes the dm-verity issue. > > OK, but here are some hacks to try: (snip) Thanks, fyi, with them applied I can see the exact same behaviour, including what I'm seeing in the logs. > > > > >> > >>> > >>> 2. As for the spurious task completion warning. > >>> I initially thought that it was bening. > >>> The check for !mrq is done before checking if we're currently doing recovery. > >>> So if it's called just right at the end of recovery, right after the > >>> cqhci_recover_mrqs is executed that would explain it. > >>> With that being said if that irq handler is run right after the > >>> recovery is finished we'll end up with a race where a new request, > >>> that was just enqueued, might be mistakenly marked as done. > >>> This would explain the dm-verity errors we're seeing. > >>> > >>>> > >>>>>> > >>>>>>> > >>>>>>> I'd argue that logging only the info that CQE recovery was executed is > >>>>>>> not particularly helpful for someone looking into those logs. > >>>>>> > >>>>>> As the comment says, that message is there because recovery reduces > >>>>>> performance, it is not to aid debugging per se. > >>>>>> > >>>>>>> Ideally we would have more data about the state the controller was in > >>>>>>> when the error happened, or at least what caused the recovery to be > >>>>>>> triggered. > >>>>>>> The question here is how verbose should we be in this error scenario. > >>>>>>> Looking at other error scenarios, in the case of a software timeout > >>>>>>> we're dumping the controller registers. (cqhci_timeout) > >>>>>> > >>>>>> Timeout means something is broken - either the driver, the cq engine > >>>>>> or the card. On the other hand, an error interrupt is most likely a > >>>>>> CRC error which is not unexpected occasionally, due to thermal drift > >>>>>> or perhaps interference. > >>>>> > >>>>> Right, but my point is that we don't know what triggered CQE recovery. > >>>> > >>>> True, although probably a CRC error. > >>>> > >>>>> > >>>>>> > >>>>>>> Hence I thought that I'd be appropriate to match that and do the same > >>>>>>> in CQE recovery logic. > >>>>>> > >>>>>> It needs to be consistent. There are other pr_debugs, such as: > >>>>>> > >>>>>> pr_debug("%s: cqhci: Failed to clear tasks\n", > >>>>>> pr_debug("%s: cqhci: Failed to halt\n", mmc_hostname(mmc)); > >>>>>> pr_debug("%s: cqhci: disable / re-enable\n", mmc_hostname(mmc)); > >>>>>> > >>>>>> which should perhaps be treated the same. > >>>>>> > >>>>>> And there are no messages for errors from the commands in > >>>>>> mmc_cqe_recovery(). > >>>>> > >>>>> How about this. > >>>>> As a compromise would it be okay to just do a single pr_warn directly > >>>>> from cqhci_error_irq. > >>>> > >>>> Sure, printk_ratelimited() or __ratelimit() > >>>> > >>>>> We could simply promote the existing pr_debug to pr_warn at the > >>>>> beginning of that function. > >>>>> This would tell us what triggered the recovery. (controller timeout, > >>>>> CRC mismatch) > >>>>> We can also consider removing the "running CQE recovery" print for the > >>>>> sake of brevity. > >>>> > >>>> No, that serves a different purpose. > >>>> > >>>>> The only downside of this that I can see is that we'd be running the > >>>>> logic from the interrupt handler directly, but I can't see an easy way > >>>>> around that. > >>>>> What do you think? > >>>> > >>>> Should be OK with rate limiting. > >>> > >>> OK, I'll look into the rate limiting and will send a v2. > >>> > >>>> > >>>>>> > >>>>>>> > >>>>>>>> > >>>>>>>>> > >>>>>>>>> /* Forget about errors when recovery has already been triggered */ > >>>>>>>>> if (cq_host->recovery_halt) > >>>>>>>> > >>>>>> > >>>> > >> >
diff --git a/drivers/mmc/host/cqhci-core.c b/drivers/mmc/host/cqhci-core.c index b3d7d6d8d654..33abb4bd53b5 100644 --- a/drivers/mmc/host/cqhci-core.c +++ b/drivers/mmc/host/cqhci-core.c @@ -700,8 +700,9 @@ static void cqhci_error_irq(struct mmc_host *mmc, u32 status, int cmd_error, terri = cqhci_readl(cq_host, CQHCI_TERRI); - pr_debug("%s: cqhci: error IRQ status: 0x%08x cmd error %d data error %d TERRI: 0x%08x\n", - mmc_hostname(mmc), status, cmd_error, data_error, terri); + pr_warn("%s: cqhci: error IRQ status: 0x%08x cmd error %d data error %d\n", + mmc_hostname(mmc), status, cmd_error, data_error); + cqhci_dumpregs(cq_host); /* Forget about errors when recovery has already been triggered */ if (cq_host->recovery_halt)
There are several reasons for controller to generate an error interrupt. They include controller<->card timeout, and CRC mismatch error. Right now we only get one line in the logs stating that CQE recovery was triggered, but with no information about what caused it. To figure out what happened be more verbose and dump the registers from irq error handler logic. This matches the behaviour of the software timeout logic, see cqhci_timeout. Signed-off-by: Kornel Dulęba <korneld@chromium.org> --- drivers/mmc/host/cqhci-core.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-)