mbox series

[v2,0/6] mmc: handle undervoltage events and prevent eMMC corruption

Message ID 20250220074429.2906141-1-o.rempel@pengutronix.de
Headers show
Series mmc: handle undervoltage events and prevent eMMC corruption | expand

Message

Oleksij Rempel Feb. 20, 2025, 7:44 a.m. UTC
This patch set introduces a framework for handling undervoltage events
in the MMC subsystem. The goal is to improve system reliability by
ensuring graceful handling of power fluctuations that could otherwise
lead to metadata corruption, potentially rendering the eMMC chip
unusable or causing significant data loss.

## Problem Statement

Power fluctuations and sudden losses can leave eMMC devices in an
undefined state, leading to severe consequences. The worst case can
result in metadata corruption, making the entire storage inaccessible.
While some eMMC devices promise to handle such situations internally,
experience shows that some chip variants are still affected. This has
led vendors to take a more protective approach, implementing external
undervoltage handling as a precautionary measure to avoid costly field
failures and returns.

The existence of the "Power Off Notification" feature in the eMMC
standard itself serves as indirect evidence that this is a real-world
issue.  While some projects have already faced the consequences of
ignoring this problem (often at significant cost), specific cases cannot
be disclosed due to NDAs.

## Challenges and Implementation Approach

1. **Raising awareness of the problem**: While vendors have used
   proprietary solutions for years, a unified approach is needed upstream.
   This patch set is a first step in making that happen.

2. **Finding an acceptable implementation path**: There are multiple
   ways to handle undervoltage - either in the kernel or in user space,
   through a global shutdown mechanism, or using the regulator framework.
   This patch set takes the kernel-based approach but does not prevent
   future extensions, such as allowing user-space handoff once available.

3. **Preparing for vendor adoption and testing**: By providing a
   structured solution upstream, this patch set lowers the barrier for
   vendors to standardize their undervoltage handling instead of relying on
   fragmented, out-of-tree implementations.

## Current Limitations

This patch set is an initial step and does not yet cover all possible
design restrictions or edge cases. Future improvements may include
better coordination with user space and enhancements based on broader
testing.

## Testing Details

The implementation was tested on an iMX8MP-based system. The board had
approximately 100ms of available power hold-up time. The Power Off
Notification was sent ~4ms after the board was detached from the power
supply, allowing sufficient time for the eMMC to handle the event
properly.  Tests were conducted under both idle conditions and active
read/write operations.

Oleksij Rempel (6):
  mmc: core: Handle undervoltage events and register regulator notifiers
  mmc: core: make mmc_interrupt_hpi() global
  mmc: core: refactor _mmc_suspend() for undervoltage handling
  mmc: core: add undervoltage handler for MMC/eMMC devices
  mmc: block: abort requests and suppress errors after undervoltage
    shutdown
  mmc: sdhci: prevent command execution after undervoltage shutdown

 drivers/mmc/core/block.c     |   2 +-
 drivers/mmc/core/core.c      |  20 ++++++
 drivers/mmc/core/core.h      |   2 +
 drivers/mmc/core/mmc.c       | 101 ++++++++++++++++++++++++------
 drivers/mmc/core/mmc_ops.c   |   2 +-
 drivers/mmc/core/mmc_ops.h   |   1 +
 drivers/mmc/core/queue.c     |   2 +-
 drivers/mmc/core/regulator.c | 115 +++++++++++++++++++++++++++++++++++
 drivers/mmc/host/sdhci.c     |   9 +++
 include/linux/mmc/host.h     |   9 +++
 10 files changed, 241 insertions(+), 22 deletions(-)

--
2.39.5

Comments

Oleksij Rempel Feb. 20, 2025, 10:56 a.m. UTC | #1
On Thu, Feb 20, 2025 at 10:47:01AM +0000, Christian Loehle wrote:
> On 2/20/25 07:44, Oleksij Rempel wrote:
> > Introduce _mmc_handle_undervoltage() to handle undervoltage events for
> > MMC/eMMC devices. This function interrupts ongoing operations using HPI
> > and performs a controlled suspend. After completing the sequence, the card
> > is marked as removed to prevent further interactions.
> > 
> > To support this, introduce __mmc_suspend() and __mmc_resume() as internal
> > helpers that omit mmc_claim_host()/mmc_release_host(), allowing them to be
> > called when the host is already claimed. Update mmc_shutdown() to skip the
> > normal shutdown sequence if the host is flagged as undervoltage to avoid
> > repeating of the shutdown procedure.
> 
> "of" can be removed here.
> 
> Given that this introduces large parts of the mmc handling IMO this commit
> deserves a lot more explanation of what steps exactly do for which cards
> and why.

ack

> > 
> > Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
> > ---
> >  drivers/mmc/core/mmc.c | 81 +++++++++++++++++++++++++++++++++++-------
> >  1 file changed, 68 insertions(+), 13 deletions(-)
> > 
> > diff --git a/drivers/mmc/core/mmc.c b/drivers/mmc/core/mmc.c
> > index 86b608843232..e626213e7a52 100644
> > --- a/drivers/mmc/core/mmc.c
> > +++ b/drivers/mmc/core/mmc.c
> > @@ -2104,8 +2104,8 @@ static int _mmc_flush_cache(struct mmc_host *host)
> >  	return err;
> >  }
> >  
> > -static int _mmc_suspend(struct mmc_host *host, bool is_suspend,
> > -			bool is_undervoltage)
> > +static int __mmc_suspend(struct mmc_host *host, bool is_suspend,
> > +			 bool is_undervoltage)
> 
> The is_undervoltage doesn't do anything? Did you forget something here?

This was done in the previous patch "mmc: core: refactor _mmc_suspend()
for undervoltage handling"

> > @@ -2189,6 +2205,9 @@ static int mmc_shutdown(struct mmc_host *host)
> >  {
> >  	int err = 0;
> >  
> > +	if (host->undervoltage)
> > +		return 0;
> > +
> 
> Probably deserves a comment.

ack

> >  	/*
> >  	 * In a specific case for poweroff notify, we need to resume the card
> >  	 * before we can shutdown it properly.
> > @@ -2280,6 +2299,41 @@ static int _mmc_hw_reset(struct mmc_host *host)
> >  	return mmc_init_card(host, card->ocr, card);
> >  }
> >  
> > +static int _mmc_handle_undervoltage(struct mmc_host *host)
> > +{
> > +	struct mmc_card *card = host->card;
> > +	int err = 0;
> > +
> > +	mmc_claim_host(host);
> > +
> > +	if (!host->card)
> > +		goto out;
> > +
> > +	if (mmc_can_poweroff_notify(host->card) &&
> > +		!(host->caps2 & MMC_CAP2_FULL_PWR_CYCLE))
> > +		err = __mmc_resume(host);
> 
> I'm not sure I follow this.
> Why would we power-up a card that currently doesn't have power when we
> know we are about to powerfail it?

It is part of the mmc_shutdown, but it is not used on my HW. So, can be
skip it.

> > +
> > +	if (!err) {
> > +		err = mmc_interrupt_hpi(card);
> > +		if (err)
> > +			pr_err("%s: Interrupt HPI failed, error %d\n",
> > +				mmc_hostname(host), err);
> 
> There's no point in calling this for SD but I don't see why it currently
> wouldn't be called for SD.

I tried to keep budget low, until we agree that it is the way to go.
After this patch stack is accepted, i can try to request more time to
add and test the SD handler.
Christian Loehle Feb. 20, 2025, 11:22 a.m. UTC | #2
On 2/20/25 10:56, Oleksij Rempel wrote:
> On Thu, Feb 20, 2025 at 10:47:01AM +0000, Christian Loehle wrote:
>> On 2/20/25 07:44, Oleksij Rempel wrote:
>>> Introduce _mmc_handle_undervoltage() to handle undervoltage events for
>>> MMC/eMMC devices. This function interrupts ongoing operations using HPI
>>> and performs a controlled suspend. After completing the sequence, the card
>>> is marked as removed to prevent further interactions.
>>>
>>> To support this, introduce __mmc_suspend() and __mmc_resume() as internal
>>> helpers that omit mmc_claim_host()/mmc_release_host(), allowing them to be
>>> called when the host is already claimed. Update mmc_shutdown() to skip the
>>> normal shutdown sequence if the host is flagged as undervoltage to avoid
>>> repeating of the shutdown procedure.
>>
>> "of" can be removed here.
>>
>> Given that this introduces large parts of the mmc handling IMO this commit
>> deserves a lot more explanation of what steps exactly do for which cards
>> and why.
> 
> ack
> 
>>>
>>> Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
>>> ---
>>>  drivers/mmc/core/mmc.c | 81 +++++++++++++++++++++++++++++++++++-------
>>>  1 file changed, 68 insertions(+), 13 deletions(-)
>>>
>>> diff --git a/drivers/mmc/core/mmc.c b/drivers/mmc/core/mmc.c
>>> index 86b608843232..e626213e7a52 100644
>>> --- a/drivers/mmc/core/mmc.c
>>> +++ b/drivers/mmc/core/mmc.c
>>> @@ -2104,8 +2104,8 @@ static int _mmc_flush_cache(struct mmc_host *host)
>>>  	return err;
>>>  }
>>>  
>>> -static int _mmc_suspend(struct mmc_host *host, bool is_suspend,
>>> -			bool is_undervoltage)
>>> +static int __mmc_suspend(struct mmc_host *host, bool is_suspend,
>>> +			 bool is_undervoltage)
>>
>> The is_undervoltage doesn't do anything? Did you forget something here?
> 
> This was done in the previous patch "mmc: core: refactor _mmc_suspend()
> for undervoltage handling"

Sorry!

> 
>>> @@ -2189,6 +2205,9 @@ static int mmc_shutdown(struct mmc_host *host)
>>>  {
>>>  	int err = 0;
>>>  
>>> +	if (host->undervoltage)
>>> +		return 0;
>>> +
>>
>> Probably deserves a comment.
> 
> ack
> 
>>>  	/*
>>>  	 * In a specific case for poweroff notify, we need to resume the card
>>>  	 * before we can shutdown it properly.
>>> @@ -2280,6 +2299,41 @@ static int _mmc_hw_reset(struct mmc_host *host)
>>>  	return mmc_init_card(host, card->ocr, card);
>>>  }
>>>  
>>> +static int _mmc_handle_undervoltage(struct mmc_host *host)
>>> +{
>>> +	struct mmc_card *card = host->card;
>>> +	int err = 0;
>>> +
>>> +	mmc_claim_host(host);
>>> +
>>> +	if (!host->card)
>>> +		goto out;
>>> +
>>> +	if (mmc_can_poweroff_notify(host->card) &&
>>> +		!(host->caps2 & MMC_CAP2_FULL_PWR_CYCLE))
>>> +		err = __mmc_resume(host);
>>
>> I'm not sure I follow this.
>> Why would we power-up a card that currently doesn't have power when we
>> know we are about to powerfail it?
> 
> It is part of the mmc_shutdown, but it is not used on my HW. So, can be
> skip it.
> 
>>> +
>>> +	if (!err) {
>>> +		err = mmc_interrupt_hpi(card);
>>> +		if (err)
>>> +			pr_err("%s: Interrupt HPI failed, error %d\n",
>>> +				mmc_hostname(host), err);
>>
>> There's no point in calling this for SD but I don't see why it currently
>> wouldn't be called for SD.
> 
> I tried to keep budget low, until we agree that it is the way to go.
> After this patch stack is accepted, i can try to request more time to
> add and test the SD handler.
If you're not implementing this for now, why not just drop the undervoltage
event at patch 1/6 if host doesn't have an eMMC attached?