[v4,3/3] PCI/AER: Clear UNCOR_STATUS bits that might be ANFE

Message ID	20240509084833.2147767-4-zhenzhong.duan@intel.com
State	New
Headers	show Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.13]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id A6411149E1A; Thu, 9 May 2024 08:51:47 +0000 (UTC) From: Zhenzhong Duan <zhenzhong.duan@intel.com> To: linux-pci@vger.kernel.org Cc: linuxppc-dev@lists.ozlabs.org, linux-acpi@vger.kernel.org, rafael@kernel.org, lenb@kernel.org, james.morse@arm.com, tony.luck@intel.com, bp@alien8.de, dave@stgolabs.net, jonathan.cameron@huawei.com, dave.jiang@intel.com, alison.schofield@intel.com, vishal.l.verma@intel.com, ira.weiny@intel.com, bhelgaas@google.com, helgaas@kernel.org, mahesh@linux.ibm.com, oohall@gmail.com, linmiaohe@huawei.com, shiju.jose@huawei.com, adam.c.preble@intel.com, lukas@wunner.de, Smita.KoralahalliChannabasappa@amd.com, rrichter@amd.com, linux-cxl@vger.kernel.org, linux-edac@vger.kernel.org, linux-kernel@vger.kernel.org, erwin.tsaur@intel.com, sathyanarayanan.kuppuswamy@intel.com, dan.j.williams@intel.com, feiting.wanyan@intel.com, yudong.wang@intel.com, chao.p.peng@intel.com, qingshun.wang@linux.intel.com, Zhenzhong Duan <zhenzhong.duan@intel.com> Subject: [PATCH v4 3/3] PCI/AER: Clear UNCOR_STATUS bits that might be ANFE Date: Thu, 9 May 2024 16:48:33 +0800 Message-Id: <20240509084833.2147767-4-zhenzhong.duan@intel.com> In-Reply-To: <20240509084833.2147767-1-zhenzhong.duan@intel.com> References: <20240509084833.2147767-1-zhenzhong.duan@intel.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	PCI/AER: Handle Advisory Non-Fatal error \| expand [v4,0/3] PCI/AER: Handle Advisory Non-Fatal error [v4,1/3] PCI/AER: Store UNCOR_STATUS bits that might be ANFE in aer_err_info [v4,2/3] PCI/AER: Print UNCOR_STATUS bits that might be ANFE [v4,3/3] PCI/AER: Clear UNCOR_STATUS bits that might be ANFE

Duan, Zhenzhong May 9, 2024, 8:48 a.m. UTC

When processing an ANFE, ideally both correctable error(CE) status and
uncorrectable error(UE) status should be cleared. However, there is no
way to fully identify the UE associated with ANFE. Even worse, Non-Fatal
Error(NFE) may set the same UE status bit as ANFE. Treating an ANFE as
NFE will bring some issues, i.e., breaking softwore probing; treating
NFE as ANFE will make us ignoring some UEs which need active recover
operation. To avoid clearing UEs that are not ANFE by accident, the
most conservative route is taken here: If any of the NFE Detected bits
is set in Device Status, do not touch UE status, they should be cleared
later by the UE handler. Otherwise, a specific set of UEs that may be
raised as ANFE according to the PCIe specification will be cleared if
their corresponding severity is Non-Fatal.

For instance, previously when kernel receives an ANFE with Poisoned TLP
in OS native AER mode, only status of CE will be reported and cleared:

  AER: Correctable error message received from 0000:b7:02.0
  PCIe Bus Error: severity=Correctable, type=Transaction Layer, (Receiver ID)
    device [8086:0db0] error status/mask=00002000/00000000
     [13] NonFatalErr

If the kernel receives a Malformed TLP after that, two UEs will be
reported, which is unexpected. Malformed TLP Header is lost since
the previous ANFE gated the TLP header logs:

  PCIe Bus Error: severity="Uncorrectable (Fatal), type=Transaction Layer, (Receiver ID)
    device [8086:0db0] error status/mask=00041000/00180020
     [12] TLP                    (First)
     [18] MalfTLP

Now, for the same scenario, both CE status and related UE status will be
reported and cleared after ANFE:

  AER: Correctable error message received from 0000:b7:02.0
  PCIe Bus Error: severity=Correctable, type=Transaction Layer, (Receiver ID)
    device [8086:0db0] error status/mask=00002000/00000000
     [13] NonFatalErr
    Uncorrectable errors that may cause Advisory Non-Fatal:
     [18] TLP

Tested-by: Yudong Wang <yudong.wang@intel.com>
Co-developed-by: "Wang, Qingshun" <qingshun.wang@linux.intel.com>
Signed-off-by: "Wang, Qingshun" <qingshun.wang@linux.intel.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
 drivers/pci/pcie/aer.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

Jonathan Cameron June 6, 2024, 3:11 p.m. UTC | #1

On Thu,  9 May 2024 16:48:33 +0800
Zhenzhong Duan <zhenzhong.duan@intel.com> wrote:

> When processing an ANFE, ideally both correctable error(CE) status and
> uncorrectable error(UE) status should be cleared. However, there is no
> way to fully identify the UE associated with ANFE. Even worse, Non-Fatal
> Error(NFE) may set the same UE status bit as ANFE. Treating an ANFE as
> NFE will bring some issues, i.e., breaking softwore probing; treating
> NFE as ANFE will make us ignoring some UEs which need active recover
> operation. To avoid clearing UEs that are not ANFE by accident, the
> most conservative route is taken here: If any of the NFE Detected bits
> is set in Device Status, do not touch UE status, they should be cleared
> later by the UE handler. Otherwise, a specific set of UEs that may be
> raised as ANFE according to the PCIe specification will be cleared if
> their corresponding severity is Non-Fatal.
> 
> For instance, previously when kernel receives an ANFE with Poisoned TLP
> in OS native AER mode, only status of CE will be reported and cleared:
> 
>   AER: Correctable error message received from 0000:b7:02.0
>   PCIe Bus Error: severity=Correctable, type=Transaction Layer, (Receiver ID)
>     device [8086:0db0] error status/mask=00002000/00000000
>      [13] NonFatalErr
> 
> If the kernel receives a Malformed TLP after that, two UEs will be
> reported, which is unexpected. Malformed TLP Header is lost since
> the previous ANFE gated the TLP header logs:
> 
>   PCIe Bus Error: severity="Uncorrectable (Fatal), type=Transaction Layer, (Receiver ID)
>     device [8086:0db0] error status/mask=00041000/00180020
>      [12] TLP                    (First)
>      [18] MalfTLP
> 
> Now, for the same scenario, both CE status and related UE status will be
> reported and cleared after ANFE:
> 
>   AER: Correctable error message received from 0000:b7:02.0
>   PCIe Bus Error: severity=Correctable, type=Transaction Layer, (Receiver ID)
>     device [8086:0db0] error status/mask=00002000/00000000
>      [13] NonFatalErr
>     Uncorrectable errors that may cause Advisory Non-Fatal:
>      [18] TLP
> 
> Tested-by: Yudong Wang <yudong.wang@intel.com>
> Co-developed-by: "Wang, Qingshun" <qingshun.wang@linux.intel.com>
> Signed-off-by: "Wang, Qingshun" <qingshun.wang@linux.intel.com>
> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>

Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>

This is nasty enough though that it would benefit from more review
if possible.  

Thanks for all the detailed explanations in the patch descriptions,
that made it less painful than it might have been.

Jonathan

Kuppuswamy, Sathyanarayanan June 13, 2024, 10:59 p.m. UTC | #2

On 5/9/24 1:48 AM, Zhenzhong Duan wrote:
> When processing an ANFE, ideally both correctable error(CE) status and
> uncorrectable error(UE) status should be cleared. However, there is no
> way to fully identify the UE associated with ANFE. Even worse, Non-Fatal
> Error(NFE) may set the same UE status bit as ANFE. Treating an ANFE as
> NFE will bring some issues, i.e., breaking softwore probing; treating
/s/softwore/software

May be this is already discussed. But can you explain why treating
AFNE as non-fatal error will bring probing issues?
> NFE as ANFE will make us ignoring some UEs which need active recover
/s/ignoring/ignore
> operation. To avoid clearing UEs that are not ANFE by accident, the
> most conservative route is taken here: If any of the NFE Detected bits
> is set in Device Status, do not touch UE status, they should be cleared
> later by the UE handler. Otherwise, a specific set of UEs that may be
> raised as ANFE according to the PCIe specification will be cleared if
> their corresponding severity is Non-Fatal.
>
> For instance, previously when kernel receives an ANFE with Poisoned TLP
> in OS native AER mode, only status of CE will be reported and cleared:
>
>   AER: Correctable error message received from 0000:b7:02.0
>   PCIe Bus Error: severity=Correctable, type=Transaction Layer, (Receiver ID)
>     device [8086:0db0] error status/mask=00002000/00000000
>      [13] NonFatalErr
>
> If the kernel receives a Malformed TLP after that, two UEs will be
> reported, which is unexpected. Malformed TLP Header is lost since
> the previous ANFE gated the TLP header logs:
>
>   PCIe Bus Error: severity="Uncorrectable (Fatal), type=Transaction Layer, (Receiver ID)
>     device [8086:0db0] error status/mask=00041000/00180020
>      [12] TLP                    (First)
>      [18] MalfTLP
>
> Now, for the same scenario, both CE status and related UE status will be
> reported and cleared after ANFE:
>
>   AER: Correctable error message received from 0000:b7:02.0
>   PCIe Bus Error: severity=Correctable, type=Transaction Layer, (Receiver ID)
>     device [8086:0db0] error status/mask=00002000/00000000
>      [13] NonFatalErr
>     Uncorrectable errors that may cause Advisory Non-Fatal:
>      [18] TLP
>
> Tested-by: Yudong Wang <yudong.wang@intel.com>
> Co-developed-by: "Wang, Qingshun" <qingshun.wang@linux.intel.com>
> Signed-off-by: "Wang, Qingshun" <qingshun.wang@linux.intel.com>
> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
> ---
>  drivers/pci/pcie/aer.c | 7 ++++++-
>  1 file changed, 6 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index ed435f09ac27..6a6a3a40569a 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -1115,9 +1115,14 @@ static void pci_aer_handle_error(struct pci_dev *dev, struct aer_err_info *info)
>  		 * Correctable error does not need software intervention.
>  		 * No need to go through error recovery process.
>  		 */
> -		if (aer)
> +		if (aer) {
>  			pci_write_config_dword(dev, aer + PCI_ERR_COR_STATUS,
>  					info->status);
> +			if (info->anfe_status)
> +				pci_write_config_dword(dev,
> +						       aer + PCI_ERR_UNCOR_STATUS,
> +						       info->anfe_status);
> +		}

Why split the handling part and storing part into two patches? Why not merge
this part of patch 1/3.

>  		if (pcie_aer_is_native(dev)) {
>  			struct pci_driver *pdrv = dev->driver;
>

Duan, Zhenzhong June 14, 2024, 2:40 a.m. UTC | #3

>-----Original Message-----
>From: Kuppuswamy, Sathyanarayanan
>Subject: Re: [PATCH v4 3/3] PCI/AER: Clear UNCOR_STATUS bits that might
>be ANFE
>
>
>On 5/9/24 1:48 AM, Zhenzhong Duan wrote:
>> When processing an ANFE, ideally both correctable error(CE) status and
>> uncorrectable error(UE) status should be cleared. However, there is no
>> way to fully identify the UE associated with ANFE. Even worse, Non-Fatal
>> Error(NFE) may set the same UE status bit as ANFE. Treating an ANFE as
>> NFE will bring some issues, i.e., breaking softwore probing; treating
>/s/softwore/software

Good catch, will fix. It's strange 'checkpatch --codespell' doesn't catch this.

>
>May be this is already discussed. But can you explain why treating
>AFNE as non-fatal error will bring probing issues?

Copied below from spec 6.1, 6.2.3.2.4, says it can results in a System Error.

In some cases the detector of a non-fatal error is not the most appropriate agent to determine whether the error is
recoverable or not, or if it even needs any recovery action at all. For example, if software attempts to perform a
configuration read from a non-existent device or Function, the resulting UR Status in the Completion will signal the error
to software, and software does not need for the Completer in addition to signal the error by sending an ERR_NONFATAL
Message. In fact, on some platforms, signaling the error with ERR_NONFATAL results in a System Error, which breaks
normal software probing.

>> NFE as ANFE will make us ignoring some UEs which need active recover
>/s/ignoring/ignore

Will fix.

>> operation. To avoid clearing UEs that are not ANFE by accident, the
>> most conservative route is taken here: If any of the NFE Detected bits
>> is set in Device Status, do not touch UE status, they should be cleared
>> later by the UE handler. Otherwise, a specific set of UEs that may be
>> raised as ANFE according to the PCIe specification will be cleared if
>> their corresponding severity is Non-Fatal.
>>
>> For instance, previously when kernel receives an ANFE with Poisoned TLP
>> in OS native AER mode, only status of CE will be reported and cleared:
>>
>>   AER: Correctable error message received from 0000:b7:02.0
>>   PCIe Bus Error: severity=Correctable, type=Transaction Layer, (Receiver ID)
>>     device [8086:0db0] error status/mask=00002000/00000000
>>      [13] NonFatalErr
>>
>> If the kernel receives a Malformed TLP after that, two UEs will be
>> reported, which is unexpected. Malformed TLP Header is lost since
>> the previous ANFE gated the TLP header logs:
>>
>>   PCIe Bus Error: severity="Uncorrectable (Fatal), type=Transaction Layer,
>(Receiver ID)
>>     device [8086:0db0] error status/mask=00041000/00180020
>>      [12] TLP                    (First)
>>      [18] MalfTLP
>>
>> Now, for the same scenario, both CE status and related UE status will be
>> reported and cleared after ANFE:
>>
>>   AER: Correctable error message received from 0000:b7:02.0
>>   PCIe Bus Error: severity=Correctable, type=Transaction Layer, (Receiver ID)
>>     device [8086:0db0] error status/mask=00002000/00000000
>>      [13] NonFatalErr
>>     Uncorrectable errors that may cause Advisory Non-Fatal:
>>      [18] TLP
>>
>> Tested-by: Yudong Wang <yudong.wang@intel.com>
>> Co-developed-by: "Wang, Qingshun" <qingshun.wang@linux.intel.com>
>> Signed-off-by: "Wang, Qingshun" <qingshun.wang@linux.intel.com>
>> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
>> ---
>>  drivers/pci/pcie/aer.c | 7 ++++++-
>>  1 file changed, 6 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
>> index ed435f09ac27..6a6a3a40569a 100644
>> --- a/drivers/pci/pcie/aer.c
>> +++ b/drivers/pci/pcie/aer.c
>> @@ -1115,9 +1115,14 @@ static void pci_aer_handle_error(struct
>pci_dev *dev, struct aer_err_info *info)
>>  		 * Correctable error does not need software intervention.
>>  		 * No need to go through error recovery process.
>>  		 */
>> -		if (aer)
>> +		if (aer) {
>>  			pci_write_config_dword(dev, aer +
>PCI_ERR_COR_STATUS,
>>  					info->status);
>> +			if (info->anfe_status)
>> +				pci_write_config_dword(dev,
>> +						       aer +
>PCI_ERR_UNCOR_STATUS,
>> +						       info->anfe_status);
>> +		}
>
>Why split the handling part and storing part into two patches? Why not
>merge
>this part of patch 1/3.

This is based on Bjorn's suggestion at https://www.spinics.net/lists/linux-pci/msg149012.html,
clearing UNCOR_STATUS might be more important, deserve to raise out.

Thanks
Zhenzhong

Kuppuswamy Sathyanarayanan June 14, 2024, 3:18 a.m. UTC | #4

On 6/13/24 7:40 PM, Duan, Zhenzhong wrote:
>
>> -----Original Message-----
>> From: Kuppuswamy, Sathyanarayanan
>> Subject: Re: [PATCH v4 3/3] PCI/AER: Clear UNCOR_STATUS bits that might
>> be ANFE
>>
>>
>> On 5/9/24 1:48 AM, Zhenzhong Duan wrote:
>>> When processing an ANFE, ideally both correctable error(CE) status and
>>> uncorrectable error(UE) status should be cleared. However, there is no
>>> way to fully identify the UE associated with ANFE. Even worse, Non-Fatal
>>> Error(NFE) may set the same UE status bit as ANFE. Treating an ANFE as
>>> NFE will bring some issues, i.e., breaking softwore probing; treating
>> /s/softwore/software
> Good catch, will fix. It's strange 'checkpatch --codespell' doesn't catch this.
>
>> May be this is already discussed. But can you explain why treating
>> AFNE as non-fatal error will bring probing issues?
> Copied below from spec 6.1, 6.2.3.2.4, says it can results in a System Error.
>
> In some cases the detector of a non-fatal error is not the most appropriate agent to determine whether the error is
> recoverable or not, or if it even needs any recovery action at all. For example, if software attempts to perform a
> configuration read from a non-existent device or Function, the resulting UR Status in the Completion will signal the error
> to software, and software does not need for the Completer in addition to signal the error by sending an ERR_NONFATAL
> Message. In fact, on some platforms, signaling the error with ERR_NONFATAL results in a System Error, which breaks
> normal software probing.
>
>>> NFE as ANFE will make us ignoring some UEs which need active recover
>> /s/ignoring/ignore
> Will fix.
>
>>> operation. To avoid clearing UEs that are not ANFE by accident, the
>>> most conservative route is taken here: If any of the NFE Detected bits
>>> is set in Device Status, do not touch UE status, they should be cleared
>>> later by the UE handler. Otherwise, a specific set of UEs that may be
>>> raised as ANFE according to the PCIe specification will be cleared if
>>> their corresponding severity is Non-Fatal.
>>>
>>> For instance, previously when kernel receives an ANFE with Poisoned TLP
>>> in OS native AER mode, only status of CE will be reported and cleared:
>>>
>>>   AER: Correctable error message received from 0000:b7:02.0
>>>   PCIe Bus Error: severity=Correctable, type=Transaction Layer, (Receiver ID)
>>>     device [8086:0db0] error status/mask=00002000/00000000
>>>      [13] NonFatalErr
>>>
>>> If the kernel receives a Malformed TLP after that, two UEs will be
>>> reported, which is unexpected. Malformed TLP Header is lost since
>>> the previous ANFE gated the TLP header logs:
>>>
>>>   PCIe Bus Error: severity="Uncorrectable (Fatal), type=Transaction Layer,
>> (Receiver ID)
>>>     device [8086:0db0] error status/mask=00041000/00180020
>>>      [12] TLP                    (First)
>>>      [18] MalfTLP
>>>
>>> Now, for the same scenario, both CE status and related UE status will be
>>> reported and cleared after ANFE:
>>>
>>>   AER: Correctable error message received from 0000:b7:02.0
>>>   PCIe Bus Error: severity=Correctable, type=Transaction Layer, (Receiver ID)
>>>     device [8086:0db0] error status/mask=00002000/00000000
>>>      [13] NonFatalErr
>>>     Uncorrectable errors that may cause Advisory Non-Fatal:
>>>      [18] TLP
>>>
>>> Tested-by: Yudong Wang <yudong.wang@intel.com>
>>> Co-developed-by: "Wang, Qingshun" <qingshun.wang@linux.intel.com>
>>> Signed-off-by: "Wang, Qingshun" <qingshun.wang@linux.intel.com>
>>> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
>>> ---
>>>  drivers/pci/pcie/aer.c | 7 ++++++-
>>>  1 file changed, 6 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
>>> index ed435f09ac27..6a6a3a40569a 100644
>>> --- a/drivers/pci/pcie/aer.c
>>> +++ b/drivers/pci/pcie/aer.c
>>> @@ -1115,9 +1115,14 @@ static void pci_aer_handle_error(struct
>> pci_dev *dev, struct aer_err_info *info)
>>>  		 * Correctable error does not need software intervention.
>>>  		 * No need to go through error recovery process.
>>>  		 */
>>> -		if (aer)
>>> +		if (aer) {
>>>  			pci_write_config_dword(dev, aer +
>> PCI_ERR_COR_STATUS,
>>>  					info->status);
>>> +			if (info->anfe_status)
>>> +				pci_write_config_dword(dev,
>>> +						       aer +
>> PCI_ERR_UNCOR_STATUS,
>>> +						       info->anfe_status);
>>> +		}
>> Why split the handling part and storing part into two patches? Why not
>> merge
>> this part of patch 1/3.
> This is based on Bjorn's suggestion at https://www.spinics.net/lists/linux-pci/msg149012.html,
> clearing UNCOR_STATUS might be more important, deserve to raise out.

I think Bjorn's suggestion is to divide it into two logical patches.
One for printing the error and another to clear the UNCOR_STATUS
properly. But currently you have split the UNCOR_STATUS status caching and
clearing process into two patches. IMO, your first patch can store ANFE
status and clear it. You can add print support in the second patch. 


Code wise it looks fine to me. You can add my Reviewed-by after fixing
the typos

Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>

>
> Thanks
> Zhenzhong

Duan, Zhenzhong June 14, 2024, 3:32 a.m. UTC | #5

>-----Original Message-----
>From: Kuppuswamy Sathyanarayanan
><sathyanarayanan.kuppuswamy@linux.intel.com>
>Subject: Re: [PATCH v4 3/3] PCI/AER: Clear UNCOR_STATUS bits that might
>be ANFE
>
>
>On 6/13/24 7:40 PM, Duan, Zhenzhong wrote:
>>
>>> -----Original Message-----
>>> From: Kuppuswamy, Sathyanarayanan
>>> Subject: Re: [PATCH v4 3/3] PCI/AER: Clear UNCOR_STATUS bits that
>might
>>> be ANFE
>>>
>>>
>>> On 5/9/24 1:48 AM, Zhenzhong Duan wrote:
>>>> When processing an ANFE, ideally both correctable error(CE) status and
>>>> uncorrectable error(UE) status should be cleared. However, there is no
>>>> way to fully identify the UE associated with ANFE. Even worse, Non-Fatal
>>>> Error(NFE) may set the same UE status bit as ANFE. Treating an ANFE as
>>>> NFE will bring some issues, i.e., breaking softwore probing; treating
>>> /s/softwore/software
>> Good catch, will fix. It's strange 'checkpatch --codespell' doesn't catch this.
>>
>>> May be this is already discussed. But can you explain why treating
>>> AFNE as non-fatal error will bring probing issues?
>> Copied below from spec 6.1, 6.2.3.2.4, says it can results in a System Error.
>>
>> In some cases the detector of a non-fatal error is not the most appropriate
>agent to determine whether the error is
>> recoverable or not, or if it even needs any recovery action at all. For
>example, if software attempts to perform a
>> configuration read from a non-existent device or Function, the resulting UR
>Status in the Completion will signal the error
>> to software, and software does not need for the Completer in addition to
>signal the error by sending an ERR_NONFATAL
>> Message. In fact, on some platforms, signaling the error with
>ERR_NONFATAL results in a System Error, which breaks
>> normal software probing.
>>
>>>> NFE as ANFE will make us ignoring some UEs which need active recover
>>> /s/ignoring/ignore
>> Will fix.
>>
>>>> operation. To avoid clearing UEs that are not ANFE by accident, the
>>>> most conservative route is taken here: If any of the NFE Detected bits
>>>> is set in Device Status, do not touch UE status, they should be cleared
>>>> later by the UE handler. Otherwise, a specific set of UEs that may be
>>>> raised as ANFE according to the PCIe specification will be cleared if
>>>> their corresponding severity is Non-Fatal.
>>>>
>>>> For instance, previously when kernel receives an ANFE with Poisoned
>TLP
>>>> in OS native AER mode, only status of CE will be reported and cleared:
>>>>
>>>>   AER: Correctable error message received from 0000:b7:02.0
>>>>   PCIe Bus Error: severity=Correctable, type=Transaction Layer, (Receiver
>ID)
>>>>     device [8086:0db0] error status/mask=00002000/00000000
>>>>      [13] NonFatalErr
>>>>
>>>> If the kernel receives a Malformed TLP after that, two UEs will be
>>>> reported, which is unexpected. Malformed TLP Header is lost since
>>>> the previous ANFE gated the TLP header logs:
>>>>
>>>>   PCIe Bus Error: severity="Uncorrectable (Fatal), type=Transaction Layer,
>>> (Receiver ID)
>>>>     device [8086:0db0] error status/mask=00041000/00180020
>>>>      [12] TLP                    (First)
>>>>      [18] MalfTLP
>>>>
>>>> Now, for the same scenario, both CE status and related UE status will be
>>>> reported and cleared after ANFE:
>>>>
>>>>   AER: Correctable error message received from 0000:b7:02.0
>>>>   PCIe Bus Error: severity=Correctable, type=Transaction Layer, (Receiver
>ID)
>>>>     device [8086:0db0] error status/mask=00002000/00000000
>>>>      [13] NonFatalErr
>>>>     Uncorrectable errors that may cause Advisory Non-Fatal:
>>>>      [18] TLP
>>>>
>>>> Tested-by: Yudong Wang <yudong.wang@intel.com>
>>>> Co-developed-by: "Wang, Qingshun" <qingshun.wang@linux.intel.com>
>>>> Signed-off-by: "Wang, Qingshun" <qingshun.wang@linux.intel.com>
>>>> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
>>>> ---
>>>>  drivers/pci/pcie/aer.c | 7 ++++++-
>>>>  1 file changed, 6 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
>>>> index ed435f09ac27..6a6a3a40569a 100644
>>>> --- a/drivers/pci/pcie/aer.c
>>>> +++ b/drivers/pci/pcie/aer.c
>>>> @@ -1115,9 +1115,14 @@ static void pci_aer_handle_error(struct
>>> pci_dev *dev, struct aer_err_info *info)
>>>>  		 * Correctable error does not need software intervention.
>>>>  		 * No need to go through error recovery process.
>>>>  		 */
>>>> -		if (aer)
>>>> +		if (aer) {
>>>>  			pci_write_config_dword(dev, aer +
>>> PCI_ERR_COR_STATUS,
>>>>  					info->status);
>>>> +			if (info->anfe_status)
>>>> +				pci_write_config_dword(dev,
>>>> +						       aer +
>>> PCI_ERR_UNCOR_STATUS,
>>>> +						       info->anfe_status);
>>>> +		}
>>> Why split the handling part and storing part into two patches? Why not
>>> merge
>>> this part of patch 1/3.
>> This is based on Bjorn's suggestion at https://www.spinics.net/lists/linux-
>pci/msg149012.html,
>> clearing UNCOR_STATUS might be more important, deserve to raise out.
>
>I think Bjorn's suggestion is to divide it into two logical patches.
>One for printing the error and another to clear the UNCOR_STATUS
>properly. But currently you have split the UNCOR_STATUS status caching and
>clearing process into two patches. IMO, your first patch can store ANFE
>status and clear it. You can add print support in the second patch.

OK, I'll merge patch1 with patch3 in next version.

>
>
>Code wise it looks fine to me. You can add my Reviewed-by after fixing
>the typos
>
>Reviewed-by: Kuppuswamy Sathyanarayanan
><sathyanarayanan.kuppuswamy@linux.intel.com>

Thanks
zhenzhong

>
>>
>> Thanks
>> Zhenzhong
>
>--
>Sathyanarayanan Kuppuswamy
>Linux Kernel Developer

[v4,3/3] PCI/AER: Clear UNCOR_STATUS bits that might be ANFE

Commit Message

Comments

Patch