Message ID | 20210407080118.1916040-7-mkl@pengutronix.de |
---|---|
State | New |
Headers | show |
Series | [net-next,1/6] can: skb: alloc_can{,fd}_skb(): set "cf" to NULL if skb allocation fails | expand |
On Wed, Apr 7, 2021 at 1:01 AM Marc Kleine-Budde <mkl@pengutronix.de> wrote: > > MCP251XFD_REG_TBC is the time base counter register. It increments > once per SYS clock tick, which is 20 or 40 MHz. Observation shows that > if the lowest byte (which is transferred first on the SPI bus) of that > register is 0x00 or 0x80 the calculated CRC doesn't always match the > transferred one. > > To reproduce this problem let the driver read the TBC register in a > high frequency. This can be done by attaching only the mcp251xfd CAN > controller to a valid terminated CAN bus and send a single CAN frame. > As there are no other CAN controller on the bus, the sent CAN frame is > not ACKed and the mcp251xfd repeats it. If user space enables the bus > error reporting, each of the NACK errors is reported with a time > stamp (which is read from the TBC register) to user space. > > $ ip link set can0 down > $ ip link set can0 up type can bitrate 500000 berr-reporting on > $ cansend can0 4FF#ff.01.00.00.00.00.00.00 > > This leads to several error messages per second: > > | mcp251xfd spi0.0 can0: CRC read error at address 0x0010 (length=4, data=00 3a 86 da, CRC=0x7753) retrying. > | mcp251xfd spi0.0 can0: CRC read error at address 0x0010 (length=4, data=80 01 b4 da, CRC=0x5830) retrying. > | mcp251xfd spi0.0 can0: CRC read error at address 0x0010 (length=4, data=00 e9 23 db, CRC=0xa723) retrying. > | mcp251xfd spi0.0 can0: CRC read error at address 0x0010 (length=4, data=00 8a 30 db, CRC=0x4a9c) retrying. > | mcp251xfd spi0.0 can0: CRC read error at address 0x0010 (length=4, data=80 f3 43 db, CRC=0x66d2) retrying. > > If the highest bit in the lowest byte is flipped the transferred CRC > matches the calculated one. We assume for now the CRC calculation in > the chip works on wrong data and the transferred data is correct. > > This patch implements the following workaround: > > - If a CRC read error on the TBC register is detected and the lowest > byte is 0x00 or 0x80, the highest bit of the lowest byte is flipped > and the CRC is calculated again. > - If the CRC now matches, the _original_ data is passed to the reader. > For now we assume transferred data was OK. > > Link: https://lore.kernel.org/r/20210406110617.1865592-5-mkl@pengutronix.de > Cc: Manivannan Sadhasivam <mani@kernel.org> > Cc: Thomas Kopp <thomas.kopp@microchip.com> > Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de> > --- > .../net/can/spi/mcp251xfd/mcp251xfd-regmap.c | 34 +++++++++++++++++++ > 1 file changed, 34 insertions(+) > > diff --git a/drivers/net/can/spi/mcp251xfd/mcp251xfd-regmap.c b/drivers/net/can/spi/mcp251xfd/mcp251xfd-regmap.c > index 35557ac43c03..297491516a26 100644 > --- a/drivers/net/can/spi/mcp251xfd/mcp251xfd-regmap.c > +++ b/drivers/net/can/spi/mcp251xfd/mcp251xfd-regmap.c > @@ -321,6 +321,40 @@ mcp251xfd_regmap_crc_read(void *context, > if (err != -EBADMSG) > return err; > > + /* MCP251XFD_REG_TBC is the time base counter > + * register. It increments once per SYS clock tick, > + * which is 20 or 40 MHz. > + * > + * Observation shows that if the lowest byte (which is > + * transferred first on the SPI bus) of that register > + * is 0x00 or 0x80 the calculated CRC doesn't always > + * match the transferred one. > + * > + * If the highest bit in the lowest byte is flipped > + * the transferred CRC matches the calculated one. We > + * assume for now the CRC calculation in the chip > + * works on wrong data and the transferred data is > + * correct. > + */ > + if (reg == MCP251XFD_REG_TBC && > + (buf_rx->data[0] == 0x0 || buf_rx->data[0] == 0x80)) { > + /* Flip highest bit in lowest byte of le32 */ > + buf_rx->data[0] ^= 0x80; > + > + /* re-check CRC */ > + err = mcp251xfd_regmap_crc_read_check_crc(buf_rx, > + buf_tx, > + val_len); > + if (!err) { > + /* If CRC is now correct, assume > + * transferred data was OK, flip bit > + * back to original value. > + */ > + buf_rx->data[0] ^= 0x80; > + goto out; > + } > + } > + > /* MCP251XFD_REG_OSC is the first ever reg we read from. > * > * The chip may be in deep sleep and this SPI transfer > -- > 2.30.2 > > Hello Marc, I am encountering similar error with the 5.10 raspberrypi kernel on RPi 4 with MCP2518FD: mcp251xfd spi0.0 can0: CRC read error at address 0x0010 (length=4, data=00 ad 58 67, CRC=0xbbfd) retrying Would it be possible for you to pull these patches into a v5.10 branch in your linux-rpi repo [1]? Thanks, Drew [1] https://github.com/marckleinebudde/linux-rpi
On 4/21/21 9:58 PM, Drew Fustini wrote: > I am encountering similar error with the 5.10 raspberrypi kernel on > RPi 4 with MCP2518FD: > > mcp251xfd spi0.0 can0: CRC read error at address 0x0010 (length=4, > data=00 ad 58 67, CRC=0xbbfd) retrying What's the situation you see these errors? I'm not particular happy with that patch, as it only works around that one particular bit flip issue. If you really hammer the register, the driver will still notice CRC errors that can be explained by other bits flipping. Consider this as the first order approximation of a higher order problem :) - the root cause is still unknown. > Would it be possible for you to pull these patches into a v5.10 branch > in your linux-rpi repo [1]? Here you are: https://github.com/marckleinebudde/linux-rpi/tree/v5.10-rpi/backport-performance-improvements I've included the UINC performance enhancements, too. The branch is compiled tested only, though. I'll send a pull request to the rpi kernel after I've testing feedback from you. regards, Marc -- Pengutronix e.K. | Marc Kleine-Budde | Embedded Linux | https://www.pengutronix.de | Vertretung West/Dortmund | Phone: +49-231-2826-924 | Amtsgericht Hildesheim, HRA 2686 | Fax: +49-5121-206917-5555 |
Am 22.04.21 um 09:18 schrieb Marc Kleine-Budde: > On 4/21/21 9:58 PM, Drew Fustini wrote: >> I am encountering similar error with the 5.10 raspberrypi kernel on >> RPi 4 with MCP2518FD: >> >> mcp251xfd spi0.0 can0: CRC read error at address 0x0010 (length=4, >> data=00 ad 58 67, CRC=0xbbfd) retrying > > What's the situation you see these errors? > > I'm not particular happy with that patch, as it only works around that one > particular bit flip issue. If you really hammer the register, the driver will > still notice CRC errors that can be explained by other bits flipping. Consider > this as the first order approximation of a higher order problem :) - the root > cause is still unknown. > >> Would it be possible for you to pull these patches into a v5.10 branch >> in your linux-rpi repo [1]? > > Here you are: > > https://github.com/marckleinebudde/linux-rpi/tree/v5.10-rpi/backport-performance-improvements > > I've included the UINC performance enhancements, too. The branch is compiled > tested only, though. I'll send a pull request to the rpi kernel after I've > testing feedback from you. > > regards, > Marc > I can also confirm these occasional CRC errors. I run a custom pytest suite on a pi3b+ that uses isotp and bcm in parallel. No CAN-FD, just 500k regular vehicle can for infotainment. $ dmesg | grep CRC [ 8.198863] mcp251xfd spi0.1 can0: MCP2518FD rev0.0 (-RX_INT -MAB_NO_WARN +CRC_REG +CRC_RX +CRC_TX +ECC -HD c:40.00MHz m:20.00MHz r:17.00MHz e:16.66MHz) successfully initialized. [ 8.209159] mcp251xfd spi0.0 can1: MCP2518FD rev0.0 (-RX_INT -MAB_NO_WARN +CRC_REG +CRC_RX +CRC_TX +ECC -HD c:40.00MHz m:20.00MHz r:17.00MHz e:16.66MHz) successfully initialized. [74264.462934] mcp251xfd spi0.0 mcp0: CRC read error at address 0x0010 (length=4, data=80 33 d2 8a, CRC=0x0e3e) retrying. [74749.267977] mcp251xfd spi0.0 mcp0: CRC read error at address 0x0010 (length=4, data=80 1a ad 0e, CRC=0x8d17) retrying. [591150.766153] mcp251xfd spi0.0 mcp0: CRC read error at address 0x0010 (length=4, data=80 3d b5 60, CRC=0x5e9c) retrying. Best Regards, Patrick
On 22.04.2021 09:18:54, Marc Kleine-Budde wrote: > On 4/21/21 9:58 PM, Drew Fustini wrote: > > I am encountering similar error with the 5.10 raspberrypi kernel on > > RPi 4 with MCP2518FD: > > > > mcp251xfd spi0.0 can0: CRC read error at address 0x0010 (length=4, > > data=00 ad 58 67, CRC=0xbbfd) retrying > > What's the situation you see these errors? > > I'm not particular happy with that patch, as it only works around that one > particular bit flip issue. If you really hammer the register, the driver will > still notice CRC errors that can be explained by other bits flipping. Consider > this as the first order approximation of a higher order problem :) - the root > cause is still unknown. > > > Would it be possible for you to pull these patches into a v5.10 branch > > in your linux-rpi repo [1]? > > Here you are: > > https://github.com/marckleinebudde/linux-rpi/tree/v5.10-rpi/backport-performance-improvements > > I've included the UINC performance enhancements, too. The branch is compiled > tested only, though. I'll send a pull request to the rpi kernel after I've > testing feedback from you. Drew, Patrick, have you tested this branch? If so I'll send a pull request to the raspi kernel. regards, Marc -- Pengutronix e.K. | Marc Kleine-Budde | Embedded Linux | https://www.pengutronix.de | Vertretung West/Dortmund | Phone: +49-231-2826-924 | Amtsgericht Hildesheim, HRA 2686 | Fax: +49-5121-206917-5555 |
Am 07.05.21 um 09:25 schrieb Marc Kleine-Budde: > On 22.04.2021 09:18:54, Marc Kleine-Budde wrote: >> On 4/21/21 9:58 PM, Drew Fustini wrote: >>> I am encountering similar error with the 5.10 raspberrypi kernel on >>> RPi 4 with MCP2518FD: >>> >>> mcp251xfd spi0.0 can0: CRC read error at address 0x0010 (length=4, >>> data=00 ad 58 67, CRC=0xbbfd) retrying >> >> What's the situation you see these errors? >> >> I'm not particular happy with that patch, as it only works around that one >> particular bit flip issue. If you really hammer the register, the driver will >> still notice CRC errors that can be explained by other bits flipping. Consider >> this as the first order approximation of a higher order problem :) - the root >> cause is still unknown. >> >>> Would it be possible for you to pull these patches into a v5.10 branch >>> in your linux-rpi repo [1]? >> >> Here you are: >> >> https://github.com/marckleinebudde/linux-rpi/tree/v5.10-rpi/backport-performance-improvements >> >> I've included the UINC performance enhancements, too. The branch is compiled >> tested only, though. I'll send a pull request to the rpi kernel after I've >> testing feedback from you. > > Drew, Patrick, have you tested this branch? If so I'll send a pull > request to the raspi kernel. > Sorry Marc, not yet. Thanks for reminding me. I'll start a native build on a pi0w asap. Is there any test application or stress test that I should run? Regards, Patrick
On 07.05.2021 08:21:57, Patrick Menschel wrote: > >>> Would it be possible for you to pull these patches into a v5.10 branch > >>> in your linux-rpi repo [1]? > >> > >> Here you are: > >> > >> https://github.com/marckleinebudde/linux-rpi/tree/v5.10-rpi/backport-performance-improvements > >> > >> I've included the UINC performance enhancements, too. The branch is compiled > >> tested only, though. I'll send a pull request to the rpi kernel after I've > >> testing feedback from you. > > > > Drew, Patrick, have you tested this branch? If so I'll send a pull > > request to the raspi kernel. > > > not yet. Thanks for reminding me. I'll start a native build on a pi0w asap. > > Is there any test application or stress test that I should run? No, not any particular, do your normal (stress) testing. regards, Marc -- Pengutronix e.K. | Marc Kleine-Budde | Embedded Linux | https://www.pengutronix.de | Vertretung West/Dortmund | Phone: +49-231-2826-924 | Amtsgericht Hildesheim, HRA 2686 | Fax: +49-5121-206917-5555 |
On Fri, May 7, 2021 at 12:56 AM Marc Kleine-Budde <mkl@pengutronix.de> wrote: > > On 22.04.2021 09:18:54, Marc Kleine-Budde wrote: > > On 4/21/21 9:58 PM, Drew Fustini wrote: > > > I am encountering similar error with the 5.10 raspberrypi kernel on > > > RPi 4 with MCP2518FD: > > > > > > mcp251xfd spi0.0 can0: CRC read error at address 0x0010 (length=4, > > > data=00 ad 58 67, CRC=0xbbfd) retrying > > > > What's the situation you see these errors? > > > > I'm not particular happy with that patch, as it only works around that one > > particular bit flip issue. If you really hammer the register, the driver will > > still notice CRC errors that can be explained by other bits flipping. Consider > > this as the first order approximation of a higher order problem :) - the root > > cause is still unknown. > > > > > Would it be possible for you to pull these patches into a v5.10 branch > > > in your linux-rpi repo [1]? > > > > Here you are: > > > > https://github.com/marckleinebudde/linux-rpi/tree/v5.10-rpi/backport-performance-improvements > > > > I've included the UINC performance enhancements, too. The branch is compiled > > tested only, though. I'll send a pull request to the rpi kernel after I've > > testing feedback from you. > > Drew, Patrick, have you tested this branch? If so I'll send a pull > request to the raspi kernel. Thank you for following up. I need to build it and send it to the friend who was testing to check if the CRC errors go away. He is testing CANFD with a 2021 Ford F150 truck. I will follow up here once I know the results. -Drew
On 07.05.2021 15:36:32, Drew Fustini wrote: > > > > Would it be possible for you to pull these patches into a v5.10 branch > > > > in your linux-rpi repo [1]? > > > > > > Here you are: > > > > > > https://github.com/marckleinebudde/linux-rpi/tree/v5.10-rpi/backport-performance-improvements > > > > > > I've included the UINC performance enhancements, too. The branch is compiled > > > tested only, though. I'll send a pull request to the rpi kernel after I've > > > testing feedback from you. > > > > Drew, Patrick, have you tested this branch? If so I'll send a pull > > request to the raspi kernel. > > Thank you for following up. > > I need to build it and send it to the friend who was testing to check > if the CRC errors go away. He is testing CANFD with a 2021 Ford F150 > truck. I will follow up here once I know the results. The CRC Errors don't go away completely, however they are reduced by more than an order of magnitude. regards, Marc -- Pengutronix e.K. | Marc Kleine-Budde | Embedded Linux | https://www.pengutronix.de | Vertretung West/Dortmund | Phone: +49-231-2826-924 | Amtsgericht Hildesheim, HRA 2686 | Fax: +49-5121-206917-5555 |
Am 07.05.21 um 10:25 schrieb Marc Kleine-Budde: > On 07.05.2021 08:21:57, Patrick Menschel wrote: >>>>> Would it be possible for you to pull these patches into a v5.10 branch >>>>> in your linux-rpi repo [1]? >>>> >>>> Here you are: >>>> >>>> https://github.com/marckleinebudde/linux-rpi/tree/v5.10-rpi/backport-performance-improvements >>>> >>>> I've included the UINC performance enhancements, too. The branch is compiled >>>> tested only, though. I'll send a pull request to the rpi kernel after I've >>>> testing feedback from you. >>> >>> Drew, Patrick, have you tested this branch? If so I'll send a pull >>> request to the raspi kernel. >>> > >> not yet. Thanks for reminding me. I'll start a native build on a pi0w asap. >> >> Is there any test application or stress test that I should run? > > No, not any particular, do your normal (stress) testing. > Following up on this. Build and test finished on a pi0w. ### Test conditions ### Since I lacked a true stress test, I wrote one for regular tox with pytest collection. https://gitlab.com/Menschel/socketcan/-/blob/master/tests/test_socketcan.py#L872 It uses mcp0 and mcp1 which are directly connected. No CAN FD, just 500k with regular frames, random id and random data. I basically mimic cangen but enhanced with a queue that handles to the rx thread what should be compared next. ### Extract from dmesg shows no CRC Errors ### [ 30.930608] CAN device driver interface [ 30.967349] spi_master spi0: will run message pump with realtime priority [ 31.054202] mcp251xfd spi0.1 can0: MCP2518FD rev0.0 (-RX_INT -MAB_NO_WARN +CRC_REG +CRC_RX +CRC_TX +ECC -HD c:40.00MHz m:20.00MHz r:17.00MHz e:16.66MHz) successfully initialized. [ 31.076906] mcp251xfd spi0.0 can1: MCP2518FD rev0.0 (-RX_INT -MAB_NO_WARN +CRC_REG +CRC_RX +CRC_TX +ECC -HD c:40.00MHz m:20.00MHz r:17.00MHz e:16.66MHz) successfully initialized. [ 31.298969] mcp251xfd spi0.0 mcp0: renamed from can1 [ 31.339864] mcp251xfd spi0.1 mcp1: renamed from can0 [ 33.471889] IPv6: ADDRCONF(NETDEV_CHANGE): mcp0: link becomes ready [ 34.482260] IPv6: ADDRCONF(NETDEV_CHANGE): mcp1: link becomes ready [ 215.218979] can: controller area network core [ 215.219146] NET: Registered protocol family 29 [ 215.261599] can: raw protocol [ 218.745376] can: isotp protocol [ 220.931150] NOHZ tick-stop error: Non-RCU local softirq work is pending, handler #08!!! [ 220.931274] NOHZ tick-stop error: Non-RCU local softirq work is pending, handler #08!!! [ 220.931395] NOHZ tick-stop error: Non-RCU local softirq work is pending, handler #08!!! [ 220.931518] NOHZ tick-stop error: Non-RCU local softirq work is pending, handler #08!!! [ 220.931643] NOHZ tick-stop error: Non-RCU local softirq work is pending, handler #08!!! [ 220.931768] NOHZ tick-stop error: Non-RCU local softirq work is pending, handler #08!!! [ 220.931893] NOHZ tick-stop error: Non-RCU local softirq work is pending, handler #08!!! [ 222.099822] NOHZ tick-stop error: Non-RCU local softirq work is pending, handler #08!!! [ 222.099901] NOHZ tick-stop error: Non-RCU local softirq work is pending, handler #08!!! [ 222.100022] NOHZ tick-stop error: Non-RCU local softirq work is pending, handler #08!!! [ 222.330438] can: broadcast manager protocol That softirq error has something to do with IsoTp. I was not able to trace it back but I have it on multiple boards: pi0w, pi3b, pi3b+. ### Performance ### ## v5.10-rpi/backport-performance-improvements ## I get about 20000 frames in 2 minutes. 2021-05-08 19:00:36 [ INFO] 20336 frames in 0:02:00 (test_socketcan.py:890) 2021-05-08 19:49:34 [ INFO] 20001 frames in 0:02:00 (test_socketcan.py:890) ## regular v5.10 ## 2021-05-08 20:19:55 [ INFO] 20000 frames in 0:02:00 (test_socketcan.py:890) 2021-05-08 20:22:40 [ INFO] 19995 frames in 0:02:00 (test_socketcan.py:890) 2021-05-08 20:25:22 [ INFO] 19931 frames in 0:02:00 (test_socketcan.py:890) The numbers are slightly better but I count that as tolerance. I also found that there are cross effects. If I run the same test on vcan0 before, the frame count goes down to 13000 instead. I also have to admit, that I didn't get any crc errors with regular v5.10 during that tests. Do I have to change my test? I can still update that pi3b+ that runs my micro-hil at work. That was the one that occasionally had CRC errors. Best Regards, Patrick
Am 08.05.21 um 20:36 schrieb Patrick Menschel: > Am 07.05.21 um 10:25 schrieb Marc Kleine-Budde: >> On 07.05.2021 08:21:57, Patrick Menschel wrote: >>>>>> Would it be possible for you to pull these patches into a v5.10 branch >>>>>> in your linux-rpi repo [1]? >>>>> >>>>> Here you are: >>>>> >>>>> https://github.com/marckleinebudde/linux-rpi/tree/v5.10-rpi/backport-performance-improvements >>>>> >>>>> I've included the UINC performance enhancements, too. The branch is compiled >>>>> tested only, though. I'll send a pull request to the rpi kernel after I've >>>>> testing feedback from you. >>>> >>>> Drew, Patrick, have you tested this branch? If so I'll send a pull >>>> request to the raspi kernel. >>>> >> >>> not yet. Thanks for reminding me. I'll start a native build on a pi0w asap. >>> >>> Is there any test application or stress test that I should run? >> >> No, not any particular, do your normal (stress) testing. >> > Following up on this. > > ... > > Do I have to change my test? Hi Marc, I changed my test to 1 hour and removed the sleep statement. Still no measurable difference for performance and no CRC Errors with both kernels. Apparently the test is hard on the CPU, I have two pytest processes listed in htop one with 80%CPU and one with 60% CPU, approx 30% ram usage of 512MB. I have no clue how it reaches the CPU values, there should be only one CPU on the pi0w. ### 5.10.17+ on pi0w ### 2021-05-09 08:02:56 [ INFO] 725649 frames in 1:00:00 (test_socketcan.py:890) ### 5.10.31-performance-backports+ on pi0w ### 2021-05-09 09:13:32 [ INFO] 715936 frames in 1:00:00 (test_socketcan.py:890) I'll switch boards to a pi3b and test again with these settings. Best Regards, Patrick
On 08.05.2021 18:36:56, Patrick Menschel wrote: > ### Test conditions ### > > Since I lacked a true stress test, I wrote one for regular tox with > pytest collection. > > https://gitlab.com/Menschel/socketcan/-/blob/master/tests/test_socketcan.py#L872 > > It uses mcp0 and mcp1 which are directly connected. > No CAN FD, just 500k with regular frames, random id and random data. > > I basically mimic cangen but enhanced with a queue that handles to the > rx thread what should be compared next. > > ### Extract from dmesg shows no CRC Errors ### > > [ 30.930608] CAN device driver interface > [ 30.967349] spi_master spi0: will run message pump with realtime priority > [ 31.054202] mcp251xfd spi0.1 can0: MCP2518FD rev0.0 (-RX_INT > -MAB_NO_WARN +CRC_REG +CRC_RX +CRC_TX +ECC -HD c:40.00MHz m:20.00MHz > r:17.00MHz e:16.66MHz) successfully initialized. > [ 31.076906] mcp251xfd spi0.0 can1: MCP2518FD rev0.0 (-RX_INT > -MAB_NO_WARN +CRC_REG +CRC_RX +CRC_TX +ECC -HD c:40.00MHz m:20.00MHz > r:17.00MHz e:16.66MHz) successfully initialized. > [ 31.298969] mcp251xfd spi0.0 mcp0: renamed from can1 > [ 31.339864] mcp251xfd spi0.1 mcp1: renamed from can0 > [ 33.471889] IPv6: ADDRCONF(NETDEV_CHANGE): mcp0: link becomes ready > [ 34.482260] IPv6: ADDRCONF(NETDEV_CHANGE): mcp1: link becomes ready > [ 215.218979] can: controller area network core > [ 215.219146] NET: Registered protocol family 29 > [ 215.261599] can: raw protocol > [ 218.745376] can: isotp protocol > [ 220.931150] NOHZ tick-stop error: Non-RCU local softirq work is > pending, handler #08!!! > [ 220.931274] NOHZ tick-stop error: Non-RCU local softirq work is > pending, handler #08!!! > [ 220.931395] NOHZ tick-stop error: Non-RCU local softirq work is > pending, handler #08!!! > [ 220.931518] NOHZ tick-stop error: Non-RCU local softirq work is > pending, handler #08!!! > [ 220.931643] NOHZ tick-stop error: Non-RCU local softirq work is > pending, handler #08!!! > [ 220.931768] NOHZ tick-stop error: Non-RCU local softirq work is > pending, handler #08!!! > [ 220.931893] NOHZ tick-stop error: Non-RCU local softirq work is > pending, handler #08!!! > [ 222.099822] NOHZ tick-stop error: Non-RCU local softirq work is > pending, handler #08!!! > [ 222.099901] NOHZ tick-stop error: Non-RCU local softirq work is > pending, handler #08!!! > [ 222.100022] NOHZ tick-stop error: Non-RCU local softirq work is > pending, handler #08!!! > [ 222.330438] can: broadcast manager protocol > > That softirq error has something to do with IsoTp. I was not able to > trace it back but I have it on multiple boards: pi0w, pi3b, pi3b+. The softirq error is known and shows up as the mcp251xfd driver raises a softirq from threaded IRQ context. We're working on fixing this. > ### Performance ### > > ## v5.10-rpi/backport-performance-improvements ## > > I get about 20000 frames in 2 minutes. > > 2021-05-08 19:00:36 [ INFO] 20336 frames in 0:02:00 > (test_socketcan.py:890) > > 2021-05-08 19:49:34 [ INFO] 20001 frames in 0:02:00 > (test_socketcan.py:890) > > > ## regular v5.10 ## > > 2021-05-08 20:19:55 [ INFO] 20000 frames in 0:02:00 > (test_socketcan.py:890) > > 2021-05-08 20:22:40 [ INFO] 19995 frames in 0:02:00 > (test_socketcan.py:890) > > 2021-05-08 20:25:22 [ INFO] 19931 frames in 0:02:00 > (test_socketcan.py:890) > > > The numbers are slightly better but I count that as tolerance. Makes sense. But you have only measured number of frames in a given time. The raspi SPI driver is highly optimized so the changes in the driver don't show up in those numbers. Thanks for testing, I'll send a pull request to the raspi kernel. If you are interested if there are performance benefits on your raspi, consider measuring the spent CPU time and the number of SPI interrupts. Measure CPU time by putting the command "time" in front of your test. Measure SPI Interrupts by looking at /proc/interrupts before and after the test. Note: there are SPI host controller interrupts and Interrupts from the mcp251xfd. On a raspi you probably only have a hand full of SPI host controller interrupts, as the raspi driver only uses interrupts for long transfers. There will be a mcp251xfd interrupt per TX-complete and RX CAN message, maybe a few less if they overlap. > I also found that there are cross effects. If I run the same test on > vcan0 before, the frame count goes down to 13000 instead. The changes only touch the mcp251xfd driver, if you see a difference with the vcan driver, it's either a change in the kernel somewhere else or your test setup is sensitive to something you changed without noticing (starting condition, ...) > I also have to admit, that I didn't get any crc errors with regular > v5.10 during that tests. The CRC errors the patch works around are CRC errors introduced by a chip erratum, not by electromagnetic interference. In my observation these CRC errors show up if the register contents changes while the register is read. The register that changes most is the timer base counter register. That register is only read if a CAN bus error is signaled to user space (and this is maximized by enabling bus error reporting). If it happens to be a CRC error while reading the TBC register and the CRC can be "corrected" by flipping the upper most bit, there will be no error message about any CRC errors. Long story short. You only notice that this patch works, if in a situation you had CRC errors on the TBC register (that is CAN errors are reported to user space), you now have an order of magnitude less CRC errors than before. > Do I have to change my test? No need to. > I can still update that pi3b+ that runs my micro-hil at work. That was > the one that occasionally had CRC errors. Thanks again for testing! regards, Marc -- Pengutronix e.K. | Marc Kleine-Budde | Embedded Linux | https://www.pengutronix.de | Vertretung West/Dortmund | Phone: +49-231-2826-924 | Amtsgericht Hildesheim, HRA 2686 | Fax: +49-5121-206917-5555 |
On 09.05.2021 07:46:20, Patrick Menschel wrote: > > Do I have to change my test? > > I changed my test to 1 hour and removed the sleep statement. > Still no measurable difference for performance and no CRC Errors with > both kernels. See other mail about my thoughts about performance and CRC. > Apparently the test is hard on the CPU, I have two pytest processes > listed in htop one with 80%CPU and one with 60% CPU, approx 30% ram > usage of 512MB. I have no clue how it reaches the CPU values, there > should be only one CPU on the pi0w. Interesting :) > ### 5.10.17+ on pi0w ### > > 2021-05-09 08:02:56 [ INFO] 725649 frames in 1:00:00 > (test_socketcan.py:890) > > ### 5.10.31-performance-backports+ on pi0w ### > > 2021-05-09 09:13:32 [ INFO] 715936 frames in 1:00:00 > (test_socketcan.py:890) > > I'll switch boards to a pi3b and test again with these settings. Marc -- Pengutronix e.K. | Marc Kleine-Budde | Embedded Linux | https://www.pengutronix.de | Vertretung West/Dortmund | Phone: +49-231-2826-924 | Amtsgericht Hildesheim, HRA 2686 | Fax: +49-5121-206917-5555 |
Am 10.05.21 um 09:45 schrieb Marc Kleine-Budde: Just checked that particular pi3b+. 2 errors in 10 days. $ dmesg | grep -i crc [ 8.335215] mcp251xfd spi0.1 can0: MCP2518FD rev0.0 (-RX_INT -MAB_NO_WARN +CRC_REG +CRC_RX +CRC_TX +ECC -HD c:40.00MHz m:20.00MHz r:17.00MHz e:16.66MHz) successfully initialized. [ 8.362778] mcp251xfd spi0.0 can1: MCP2518FD rev0.0 (-RX_INT -MAB_NO_WARN +CRC_REG +CRC_RX +CRC_TX +ECC -HD c:40.00MHz m:20.00MHz r:17.00MHz e:16.66MHz) successfully initialized. [178691.606232] mcp251xfd spi0.0 mcp0: CRC read error at address 0x0010 (length=4, data=84 35 73 16, CRC=0x9b26) retrying. [188341.664546] mcp251xfd spi0.0 mcp0: CRC read error at address 0x0010 (length=4, data=14 df c5 f5, CRC=0xe06b) retrying. Regards, Patrick
diff --git a/drivers/net/can/spi/mcp251xfd/mcp251xfd-regmap.c b/drivers/net/can/spi/mcp251xfd/mcp251xfd-regmap.c index 35557ac43c03..297491516a26 100644 --- a/drivers/net/can/spi/mcp251xfd/mcp251xfd-regmap.c +++ b/drivers/net/can/spi/mcp251xfd/mcp251xfd-regmap.c @@ -321,6 +321,40 @@ mcp251xfd_regmap_crc_read(void *context, if (err != -EBADMSG) return err; + /* MCP251XFD_REG_TBC is the time base counter + * register. It increments once per SYS clock tick, + * which is 20 or 40 MHz. + * + * Observation shows that if the lowest byte (which is + * transferred first on the SPI bus) of that register + * is 0x00 or 0x80 the calculated CRC doesn't always + * match the transferred one. + * + * If the highest bit in the lowest byte is flipped + * the transferred CRC matches the calculated one. We + * assume for now the CRC calculation in the chip + * works on wrong data and the transferred data is + * correct. + */ + if (reg == MCP251XFD_REG_TBC && + (buf_rx->data[0] == 0x0 || buf_rx->data[0] == 0x80)) { + /* Flip highest bit in lowest byte of le32 */ + buf_rx->data[0] ^= 0x80; + + /* re-check CRC */ + err = mcp251xfd_regmap_crc_read_check_crc(buf_rx, + buf_tx, + val_len); + if (!err) { + /* If CRC is now correct, assume + * transferred data was OK, flip bit + * back to original value. + */ + buf_rx->data[0] ^= 0x80; + goto out; + } + } + /* MCP251XFD_REG_OSC is the first ever reg we read from. * * The chip may be in deep sleep and this SPI transfer
MCP251XFD_REG_TBC is the time base counter register. It increments once per SYS clock tick, which is 20 or 40 MHz. Observation shows that if the lowest byte (which is transferred first on the SPI bus) of that register is 0x00 or 0x80 the calculated CRC doesn't always match the transferred one. To reproduce this problem let the driver read the TBC register in a high frequency. This can be done by attaching only the mcp251xfd CAN controller to a valid terminated CAN bus and send a single CAN frame. As there are no other CAN controller on the bus, the sent CAN frame is not ACKed and the mcp251xfd repeats it. If user space enables the bus error reporting, each of the NACK errors is reported with a time stamp (which is read from the TBC register) to user space. $ ip link set can0 down $ ip link set can0 up type can bitrate 500000 berr-reporting on $ cansend can0 4FF#ff.01.00.00.00.00.00.00 This leads to several error messages per second: | mcp251xfd spi0.0 can0: CRC read error at address 0x0010 (length=4, data=00 3a 86 da, CRC=0x7753) retrying. | mcp251xfd spi0.0 can0: CRC read error at address 0x0010 (length=4, data=80 01 b4 da, CRC=0x5830) retrying. | mcp251xfd spi0.0 can0: CRC read error at address 0x0010 (length=4, data=00 e9 23 db, CRC=0xa723) retrying. | mcp251xfd spi0.0 can0: CRC read error at address 0x0010 (length=4, data=00 8a 30 db, CRC=0x4a9c) retrying. | mcp251xfd spi0.0 can0: CRC read error at address 0x0010 (length=4, data=80 f3 43 db, CRC=0x66d2) retrying. If the highest bit in the lowest byte is flipped the transferred CRC matches the calculated one. We assume for now the CRC calculation in the chip works on wrong data and the transferred data is correct. This patch implements the following workaround: - If a CRC read error on the TBC register is detected and the lowest byte is 0x00 or 0x80, the highest bit of the lowest byte is flipped and the CRC is calculated again. - If the CRC now matches, the _original_ data is passed to the reader. For now we assume transferred data was OK. Link: https://lore.kernel.org/r/20210406110617.1865592-5-mkl@pengutronix.de Cc: Manivannan Sadhasivam <mani@kernel.org> Cc: Thomas Kopp <thomas.kopp@microchip.com> Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de> --- .../net/can/spi/mcp251xfd/mcp251xfd-regmap.c | 34 +++++++++++++++++++ 1 file changed, 34 insertions(+)