Message ID | 20220210114218.632725-1-damien.lemoal@opensource.wdc.com |
---|---|
Headers | show |
Series | libsas and pm8001 fixes | expand |
On 10/02/2022 11:42, Damien Le Moal wrote: > In pm8001_pci_resume(), the use of the u32 type for the local variable > device_state causes a sparse warning: > > warning: incorrect type in assignment (different base types) > expected unsigned int [usertype] device_state > got restricted pci_power_t [usertype] current_state > > Since this variable is used only once in the function, remove it and > use pdev->current_state directly. While at it, also add a blank line > after the last local variable declaration. > > Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Regardless of a couple of comments: Reviewed-by: John Garry <john.garry@huawei.com> > --- > drivers/scsi/pm8001/pm8001_init.c | 8 ++++---- > 1 file changed, 4 insertions(+), 4 deletions(-) > > diff --git a/drivers/scsi/pm8001/pm8001_init.c b/drivers/scsi/pm8001/pm8001_init.c > index d8a2121cb8d9..4b9a26f008a9 100644 > --- a/drivers/scsi/pm8001/pm8001_init.c > +++ b/drivers/scsi/pm8001/pm8001_init.c > @@ -1335,13 +1335,13 @@ static int __maybe_unused pm8001_pci_resume(struct device *dev) > struct pm8001_hba_info *pm8001_ha; > int rc; > u8 i = 0, j; > - u32 device_state; > DECLARE_COMPLETION_ONSTACK(completion); > + > pm8001_ha = sha->lldd_ha; > - device_state = pdev->current_state; > > - pm8001_info(pm8001_ha, "pdev=0x%p, slot=%s, resuming from previous operating state [D%d]\n", > - pdev, pm8001_ha->name, device_state); > + pm8001_info(pm8001_ha, > + "pdev=0x%p, slot=%s, resuming from previous operating state [D%d]\n", I think that we may put this on the same line as pm8001_info Feel free to ignore this: if we're ok with changing logs, I am not sure on the "slot" value - it is already printed with pm8001_info. And printing pdev is suspect, since we should really be using dev_info or pci_info() and friends - but that is a bigger job. Thanks, John > + pdev, pm8001_ha->name, pdev->current_state); > > rc = pci_go_44(pdev); > if (rc)
On 2/10/22 23:28, John Garry wrote: > On 10/02/2022 11:42, Damien Le Moal wrote: >> In pm80xx_send_abort_all(), the n_elem field of the ccb used is not >> initialized to 0. This missing initialization sometimes lead to the >> task completion path seeing the ccb with a non-zero n_elem resulting in >> the execution of invalid dma_unmap_sg() calls in pm8001_ccb_task_free(), >> causing a crash such as: >> >> [ 197.676341] RIP: 0010:iommu_dma_unmap_sg+0x6d/0x280 >> [ 197.700204] RSP: 0018:ffff889bbcf89c88 EFLAGS: 00010012 >> [ 197.705485] RAX: dffffc0000000000 RBX: 0000000000000000 RCX: ffffffff83d0bda0 >> [ 197.712687] RDX: 0000000000000002 RSI: 0000000000000000 RDI: ffff88810dffc0d0 >> [ 197.719887] RBP: 0000000000000000 R08: 0000000000000000 R09: ffff8881c790098b >> [ 197.727089] R10: ffffed1038f20131 R11: 0000000000000001 R12: 0000000000000000 >> [ 197.734296] R13: ffff88810dffc0d0 R14: 0000000000000010 R15: 0000000000000000 >> [ 197.741493] FS: 0000000000000000(0000) GS:ffff889bbcf80000(0000) knlGS:0000000000000000 >> [ 197.749659] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 >> [ 197.755459] CR2: 00007f16c1b42734 CR3: 0000000004814000 CR4: 0000000000350ee0 >> [ 197.762656] Call Trace: >> [ 197.765127] <IRQ> >> [ 197.767162] pm8001_ccb_task_free+0x5f1/0x820 [pm80xx] >> [ 197.772364] ? do_raw_spin_unlock+0x54/0x220 >> [ 197.776680] pm8001_mpi_task_abort_resp+0x2ce/0x4f0 [pm80xx] >> [ 197.782406] process_oq+0xe85/0x7890 [pm80xx] >> [ 197.786817] ? lock_acquire+0x194/0x490 >> [ 197.790697] ? handle_irq_event+0x10e/0x1b0 >> [ 197.794920] ? mpi_sata_completion+0x2d70/0x2d70 [pm80xx] >> [ 197.800378] ? __wake_up_bit+0x100/0x100 >> [ 197.804340] ? lock_is_held_type+0x98/0x110 >> [ 197.808565] pm80xx_chip_isr+0x94/0x130 [pm80xx] >> [ 197.813243] tasklet_action_common.constprop.0+0x24b/0x2f0 >> [ 197.818785] __do_softirq+0x1b5/0x82d >> [ 197.822485] ? do_raw_spin_unlock+0x54/0x220 >> [ 197.826799] __irq_exit_rcu+0x17e/0x1e0 >> [ 197.830678] irq_exit_rcu+0xa/0x20 >> [ 197.834114] common_interrupt+0x78/0x90 >> [ 197.840051] </IRQ> >> [ 197.844236] <TASK> >> [ 197.848397] asm_common_interrupt+0x1e/0x40 >> > > That's nasty. > >> Avoid this issue by always initializing the ccb n_elem field to 0 in >> pm8001_send_abort_all(), pm8001_send_read_log() and >> pm80xx_send_abort_all(). >> >> Fixes: c6b9ef5779c3 ("[SCSI] pm80xx: NCQ error handling changes") >> Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> >> --- >> drivers/scsi/pm8001/pm8001_hwi.c | 2 ++ >> drivers/scsi/pm8001/pm80xx_hwi.c | 1 + >> 2 files changed, 3 insertions(+) >> >> diff --git a/drivers/scsi/pm8001/pm8001_hwi.c b/drivers/scsi/pm8001/pm8001_hwi.c >> index 8095eb0b04f7..d853e8d0195a 100644 >> --- a/drivers/scsi/pm8001/pm8001_hwi.c >> +++ b/drivers/scsi/pm8001/pm8001_hwi.c >> @@ -1788,6 +1788,7 @@ static void pm8001_send_abort_all(struct pm8001_hba_info *pm8001_ha, >> ccb->device = pm8001_ha_dev; >> ccb->ccb_tag = ccb_tag; >> ccb->task = task; >> + ccb->n_elem = 0; > > Do you think that it would be better to clear this field when we free > the tag/ccb in pm8001_ccb_task_free()? I will note that there are error > paths whch only free the tag (not ccb), so need to be careful there. I am thinking that hunk like this one: ccb = &pm8001_ha->ccb_info[ccb_tag]; ccb->device = pm8001_ha_dev; ccb->ccb_tag = ccb_tag; ccb->task = task; To initialize a new ccb should be wrapped into a helper function, which would also add initialization for the other ccb fields. There are many places that have code like this, so that will also nicely cleanup the code. Then the free path can be left alone. Hmm ? > > BTW, I see that this never landed: > https://lore.kernel.org/lkml/20211214090337.29156-1-niejianglei2021@163.com/ Repost ! > > Though alloc'ing a domain_device in pm8001_send_read_log() is questionable. Yes, and the messages say "device not found" when the task completes so I think it is 100% useless. But I did not touch that since it did not seem to cause troubles. > > Thanks, > John > >> >> circularQ = &pm8001_ha->inbnd_q_tbl[0]; >> >> @@ -1849,6 +1850,7 @@ static void pm8001_send_read_log(struct pm8001_hba_info *pm8001_ha, >> ccb->device = pm8001_ha_dev; >> ccb->ccb_tag = ccb_tag; >> ccb->task = task; >> + ccb->n_elem = 0; >> pm8001_ha_dev->id |= NCQ_READ_LOG_FLAG; >> pm8001_ha_dev->id |= NCQ_2ND_RLE_FLAG; >> >> diff --git a/drivers/scsi/pm8001/pm80xx_hwi.c b/drivers/scsi/pm8001/pm80xx_hwi.c >> index 4d88c0dbcefc..902af4eefa26 100644 >> --- a/drivers/scsi/pm8001/pm80xx_hwi.c >> +++ b/drivers/scsi/pm8001/pm80xx_hwi.c >> @@ -1801,6 +1801,7 @@ static void pm80xx_send_abort_all(struct pm8001_hba_info *pm8001_ha, >> ccb->device = pm8001_ha_dev; >> ccb->ccb_tag = ccb_tag; >> ccb->task = task; >> + ccb->n_elem = 0; >> >> circularQ = &pm8001_ha->inbnd_q_tbl[0]; >> >
On 2/11/22 00:35, John Garry wrote: > On 10/02/2022 11:41, Damien Le Moal wrote: >> The first 3 patches fix a problem with libsas handling of NCQ NON DATA >> commands and simplify libsas code in a couple of places. >> The remaining patches are fixes for the pm8001 driver: >> * All sparse warnings are addressed, fixing along the way many le32 >> handling bugs for big-endian architectures >> * Fix handling of NCQ NON DATA commands >> * Fix NCQ error recovery (abort all task execution) that was causing a >> crash >> * Simplify the code in many places >> >> With these fixes, libzbc test suite passes all test case. This test >> suite was used with an SMR drive for testing because it generates many >> NCQ NON DATA commands (for zone management commands) and also generates >> many NCQ command errors to check ASC/ASCQ returned by the device. With >> the test suite, the error recovery path was extensively exercised. >> >> Note that without these patches, libzbc test suite result in the >> controller hanging, or in kernel crashes. > > Unfortunately I still see the hang on my arm64 system with this series :( That is unfortunate. Any particular command sequence triggering the hang ? Or is it random ? What workload are you running ? > > Thanks, > John
On Thu, Feb 10, 2022 at 08:42:06PM +0900, Damien Le Moal wrote: > +++ b/drivers/scsi/pm8001/pm80xx_hwi.c > @@ -71,14 +71,13 @@ static void pm80xx_pci_mem_copy(struct pm8001_hba_info *pm8001_ha, u32 soffset, > u32 dw_count, u32 bus_base_number) > { > u32 index, value, offset; > - u32 *destination1; > - destination1 = (u32 *)destination; > + __le32 *destination1 = (__le32 *)destination; I think the right fix here is to declare the destination argument as a le32 pointer without the incorrect const attribute.
On 2/11/22 15:14, Christoph Hellwig wrote: > On Thu, Feb 10, 2022 at 08:42:06PM +0900, Damien Le Moal wrote: >> +++ b/drivers/scsi/pm8001/pm80xx_hwi.c >> @@ -71,14 +71,13 @@ static void pm80xx_pci_mem_copy(struct pm8001_hba_info *pm8001_ha, u32 soffset, >> u32 dw_count, u32 bus_base_number) >> { >> u32 index, value, offset; >> - u32 *destination1; >> - destination1 = (u32 *)destination; >> + __le32 *destination1 = (__le32 *)destination; > > I think the right fix here is to declare the destination argument as a > le32 pointer without the incorrect const attribute. Yes. Much cleaner :)
On 10/02/2022 22:44, Damien Le Moal wrote: Hi Damien, >>> Note that without these patches, libzbc test suite result in the >>> controller hanging, or in kernel crashes. >> Unfortunately I still see the hang on my arm64 system with this series:( > That is unfortunate. Any particular command sequence triggering the hang > ? Or is it random ? What workload are you running ? > mount/unmount fails mostly even after as few as one attempt, but then even fdisk -l fails sometimes: root@(none)$ fdisk -l [ 97.924789] sas: Enter sas_scsi_recover_host busy: 1 failed: 1 [ 97.930652] sas: sas_scsi_find_task: aborting task 0x(____ptrval____) [ 97.937149] pm80xx0:: mpi_ssp_completion 1937:sas IO status 0x3b [ 97.943232] pm80xx0:: mpi_ssp_completion 1948:SAS Address of IO Failure Drive:5000c500a7babc61 [ 97.952020] pm80xx0:: mpi_ssp_completion 1937:sas IO status 0x3b [ 97.958099] pm80xx0:: mpi_ssp_completion 1948:SAS Address of IO Failure Drive:5000c500a7babc61 [ 97.966881] pm80xx0:: mpi_ssp_completion 1937:sas IO status 0x3b [ 97.972961] pm80xx0:: mpi_ssp_completion 1948:SAS Address of IO Failure Drive:5000c500a7babc61 [ 97.981737] pm80xx0:: pm8001_mpi_task_abort_resp 3682:task abort failed status 0x6 ,tag = 0x2, scp= 0x0 [ 97.991241] pm80xx0:: pm8001_mpi_task_abort_resp 3682:task abort failed status 0x6 ,tag = 0x2, scp= 0x0 [ 98.000752] pm80xx0:: pm8001_mpi_task_abort_resp 3682:task abort failed status 0x6 ,tag = 0x2, scp= 0x0 [ 98.010223] pm80xx0:: pm8001_abort_task 1345:rc= -5 [ 98.015180] sas: sas_scsi_find_task: querying task 0x(____ptrval____) [ 98.021645] pm80xx0:: mpi_ssp_completion 1937:sas IO status 0x3b [ 98.027725] pm80xx0:: mpi_ssp_completion 1948:SAS Address of IO Failure Drive:5000c500a7babc61 [ 98.036495] pm80xx0:: mpi_ssp_completion 1937:sas IO status 0x3b [ 98.042574] pm80xx0:: mpi_ssp_completion 1948:SAS Address of IO Failure Drive:5000c500a7babc61 [ 98.051344] pm80xx0:: mpi_ssp_completion 1937:sas IO status 0x3b [ 98.057423] pm80xx0:: mpi_ssp_completion 1948:SAS Address of IO Failure Drive:5000c500a7babc61 [ 98.066154] pm80xx: rc= -5 [ 98.068854] sas: sas_scsi_find_task: aborting task 0x(____ptrval____) [ 98.075331] pm80xx0:: mpi_ssp_completion 1937:sas IO status 0x3b [ 98.081411] pm80xx0:: mpi_ssp_completion 1948:SAS Address of IO Failure Drive:5000c500a7babc61 [ 98.090192] pm80xx0:: mpi_ssp_completion 1937:sas IO status 0x3b [ 98.096271] pm80xx0:: mpi_ssp_completion 1948:SAS Address of IO Failure Drive:5000c500a7babc61 [ 98.105053] pm80xx0:: mpi_ssp_completion 1937:sas IO status 0x3b [ 98.111132] pm80xx0:: mpi_ssp_completion 1948:SAS Address of IO Failure Drive:5000c500a7babc61 [ 98.119909] pm80xx0:: pm8001_mpi_task_abort_resp 3682:task abort failed status 0x6 ,tag = 0x2, scp= 0x0 [ 98.129414] pm80xx0:: pm8001_mpi_task_abort_resp 3682:task abort failed status 0x6 ,tag = 0x2, scp= 0x0 [ 98.138919] pm80xx0:: pm8001_mpi_task_abort_resp 3682:task abort failed status 0x6 ,tag = 0x2, scp= 0x0 [ 98.148388] pm80xx0:: pm8001_abort_task 1345:rc= -5 [ 98.153345] sas: sas_scsi_find_task: querying task 0x(____ptrval____) [ 98.159812] pm80xx0:: mpi_ssp_completion 1937:sas IO status 0x3b [ 98.165892] pm80xx0:: mpi_ssp_completion 1948:SAS Address of IO Failure Drive:5000c500a7babc61 [ 98.174661] pm80xx0:: mpi_ssp_completion 1937:sas IO status 0x3b [ 98.180741] pm80xx0:: mpi_ssp_completion 1948:SAS Address of IO Failure Drive:5000c500a7babc61 [ 98.189511] pm80xx0:: mpi_ssp_completion 1937:sas IO status 0x3b [ 98.195590] pm80xx0:: mpi_ssp_completion 1948:SAS Address of IO Failure Drive:5000c500a7babc61 [ 98.204320] pm80xx: rc= -5 [ 98.207019] sas: sas_scsi_find_task: aborting task 0x(____ptrval____) [ 98.213497] pm80xx0:: mpi_ssp_completion 1937:sas IO status 0x3b [ 98.219577] pm80xx0:: mpi_ssp_completion 1948:SAS Address of IO Failure Drive:5000c500a7babc61 [ 98.228359] pm80xx0:: mpi_ssp_completion 1937:sas IO status 0x3b [ 98.234438] pm80xx0:: mpi_ssp_completion 1948:SAS Address of IO Failure Drive:5000c500a7babc61 [ 98.243220] pm80xx0:: mpi_ssp_completion 1937:sas IO status 0x3b [ 98.249299] pm80xx0:: mpi_ssp_completion 1948:SAS Address of IO Failure Drive:5000c500a7babc61 [ 98.258075] pm80xx0:: pm8001_mpi_task_abort_resp 3682:task abort failed status 0x6 ,tag = 0x2, scp= 0x0 [ 98.267580] pm80xx0:: pm8001_mpi_task_abort_resp 3682:task abort failed status 0x6 ,tag = 0x2, scp= 0x0 [ 98.277085] pm80xx0:: pm8001_mpi_task_abort_resp 3682:task abort failed status 0x6 ,tag = 0x2, scp= 0x0 [ 98.286554] pm80xx0:: pm8001_abort_task 1345:rc= -5 [ 98.291510] sas: sas_scsi_find_task: querying task 0x(____ptrval____) [ 98.297978] pm80xx0:: mpi_ssp_completion 1937:sas IO status 0x3b [ 98.304059] pm80xx0:: mpi_ssp_completion 1948:SAS Address of IO Failure Drive:5000c500a7babc61 [ 98.312827] pm80xx0:: mpi_ssp_completion 1937:sas IO status 0x3b [ 98.318906] pm80xx0:: mpi_ssp_completion 1948:SAS Address of IO Failure Drive:5000c500a7babc61 [ 98.327677] pm80xx0:: mpi_ssp_completion 1937:sas IO status 0x3b [ 98.333756] pm80xx0:: mpi_ssp_completion 1948:SAS Address of IO Failure Drive:5000c500a7babc61 [ 98.342487] pm80xx: rc= -5 [ 98.345186] sas: sas_scsi_find_task: aborting task 0x(____ptrval____) [ 98.351664] pm80xx0:: mpi_ssp_completion 1937:sas IO status 0x3b [ 98.357743] pm80xx0:: mpi_ssp_completion 1948:SAS Address of IO Failure Drive:5000c500a7babc61 [ 98.366525] pm80xx0:: mpi_ssp_completion 1937:sas IO status 0x3b [ 98.372604] pm80xx0:: mpi_ssp_completion 1948:SAS Address of IO Failure Drive:5000c500a7babc61 [ 98.381386] pm80xx0:: mpi_ssp_completion 1937:sas IO status 0x3b [ 98.387465] pm80xx0:: mpi_ssp_completion 1948:SAS Address of IO Failure Drive:5000c500a7babc61 [ 98.396230] pm80xx0:: pm8001_mpi_task_abort_resp 3682:task abort failed status 0x6 ,tag = 0x2, scp= 0x0 [ 98.405734] pm80xx0:: pm8001_mpi_task_abort_resp 3682:task abort failed status 0x6 ,tag = 0x2, scp= 0x0 [ 98.415239] pm80xx0:: pm8001_mpi_task_abort_resp 3682:task abort failed status 0x6 ,tag = 0x2, scp= 0x0 [ 98.424709] pm80xx0:: pm8001_abort_task 1345:rc= -5 [ 98.429665] sas: sas_scsi_find_task: querying task 0x(____ptrval____) [ 98.436133] pm80xx0:: mpi_ssp_completion 1937:sas IO status 0x3b [ 98.442213] pm80xx0:: mpi_ssp_completion 1948:SAS Address of IO Failure Drive:5000c500a7babc61 [ 98.450982] pm80xx0:: mpi_ssp_completion 1937:sas IO status 0x3b [ 98.457061] pm80xx0:: mpi_ssp_completion 1948:SAS Address of IO Failure Drive:5000c500a7babc61 [ 98.465825] pm80xx0:: mpi_ssp_completion 1937:sas IO status 0x3b [ 98.471904] pm80xx0:: mpi_ssp_completion 1948:SAS Address of IO Failure Drive:5000c500a7babc61 [ 98.480629] pm80xx: rc= -5 [ 98.483327] sas: sas_scsi_find_task: aborting task 0x(____ptrval____) [ 98.489800] pm80xx0:: mpi_ssp_completion 1937:sas IO status 0x3b [ 98.495880] pm80xx0:: mpi_ssp_completion 1948:SAS Address of IO Failure Drive:5000c500a7babc61 [ 98.504661] pm80xx0:: mpi_ssp_completion 1937:sas IO status 0x3b [ 98.510740] pm80xx0:: mpi_ssp_completion 1948:SAS Address of IO Failure Drive:5000c500a7babc61 [ 98.519523] pm80xx0:: mpi_ssp_completion 1937:sas IO status 0x3b [ 98.525602] pm80xx0:: mpi_ssp_completion 1948:SAS Address of IO Failure Drive:5000c500a7babc61 [ 98.534372] pm80xx0:: pm8001_mpi_task_abort_resp 3682:task abort failed status 0x6 ,tag = 0x2, scp= 0x0 [ 98.543877] pm80xx0:: pm8001_mpi_task_abort_resp 3682:task abort failed status 0x6 ,tag = 0x2, scp= 0x0 [ 98.553382] pm80xx0:: pm8001_mpi_task_abort_resp 3682:task abort failed status 0x6 ,tag = 0x2, scp= 0x0 [ 98.562851] pm80xx0:: pm8001_abort_task 1345:rc= -5 [ 98.567807] sas: sas_scsi_find_task: querying task 0x(____ptrval____) [ 98.574275] pm80xx0:: mpi_ssp_completion 1937:sas IO status 0x3b [ 98.580355] pm80xx0:: mpi_ssp_completion 1948:SAS Address of IO Failure Drive:5000c500a7babc61 [ 98.589124] pm80xx0:: mpi_ssp_completion 1937:sas IO status 0x3b [ 98.595203] pm80xx0:: mpi_ssp_completion 1948:SAS Address of IO Failure Drive:5000c500a7babc61 [ 98.603968] pm80xx0:: mpi_ssp_completion 1937:sas IO status 0x3b [ 98.610047] pm80xx0:: mpi_ssp_completion 1948:SAS Address of IO Failure Drive:5000c500a7babc61 [ 98.618778] pm80xx: rc= -5 [ 98.621505] sas: --- Exit sas_scsi_recover_host: busy: 1 failed: 1 tries: 1 Sometimes I get TMF timeouts, which is a bad situation. I guess it's a subtle driver bug, but where ....? BTW, this following log needs removal/fixing at some stage by someone: [ 98.480629] pm80xx: rc= -5 It's from pm8001_query_task(). Thanks, John
On 2/11/22 18:24, John Garry wrote: > On 10/02/2022 22:44, Damien Le Moal wrote: > > Hi Damien, > >>>> Note that without these patches, libzbc test suite result in the >>>> controller hanging, or in kernel crashes. >>> Unfortunately I still see the hang on my arm64 system with this series:( >> That is unfortunate. Any particular command sequence triggering the hang >> ? Or is it random ? What workload are you running ? >> > > mount/unmount fails mostly even after as few as one attempt, but then > even fdisk -l fails sometimes: Try with patch 21 of my v2. It does fix a bug for scsi/sas case. That problem would likely lead to a crash though, but never know... > root@(none)$ fdisk -l > [ 97.924789] sas: Enter sas_scsi_recover_host busy: 1 failed: 1 > [ 97.930652] sas: sas_scsi_find_task: aborting task 0x(____ptrval____) > [ 97.937149] pm80xx0:: mpi_ssp_completion 1937:sas IO status 0x3b > [ 97.943232] pm80xx0:: mpi_ssp_completion 1948:SAS Address of IO > Failure Drive:5000c500a7babc61 [...] > > Sometimes I get TMF timeouts, which is a bad situation. I guess it's a > subtle driver bug, but where ....? What is the command failing ? Always the same ? Can you try adding scsi trace to see the commands ? If you are "lucky", it is always the same type of command like for the NCQ NON DATA in my case. Though on mount, I would only expect a lot of read commands and not much else. There may be some writes and a flush too, so there will be "data" commands and "non data" commands. It may be an issue with non-data commands too ? > BTW, this following log needs removal/fixing at some stage by someone: > > [ 98.480629] pm80xx: rc= -5 > > It's from pm8001_query_task(). > > Thanks, > John
On 11/02/2022 12:37, Damien Le Moal wrote: Hi Damien, >> Sometimes I get TMF timeouts, which is a bad situation. I guess it's a >> subtle driver bug, but where ....? > What is the command failing ? Always the same ? Can you try adding scsi > trace to see the commands ? This is the same issue I have had since day #1. Generally mount/unmount or even fdisk -l fails after booting into miniramfs. I wouldn't ever try to boot a distro. > > If you are "lucky", it is always the same type of command like for the > NCQ NON DATA in my case. I'm just trying SAS disks to start with - so it's an SCSI READ command. SATA/STP support is generally never as robust for SAS HBAs (HW and LLD bugs are common - as this series is evidence) so I start on something more basic - however SATA/STP also has this issue. The command is sent successfully but just never completes. Then sometimes the TMFs for error handling timeout and sometimes succeed. I don't have much to do on.... > Though on mount, I would only expect a lot of > read commands and not much else. Yes, and it is commonly the first SCSI read command which times out. It reliably breaks quite early. So I can think we can rule out issues like memory barriers/timing. There may be some writes and a flush > too, so there will be "data" commands and "non data" commands. It may be > an issue with non-data commands too ? > Not sure on that. I guess it isn't. Thanks, John
On 2/11/22 22:08, John Garry wrote: > On 11/02/2022 12:37, Damien Le Moal wrote: > > Hi Damien, > >>> Sometimes I get TMF timeouts, which is a bad situation. I guess it's a >>> subtle driver bug, but where ....? >> What is the command failing ? Always the same ? Can you try adding scsi >> trace to see the commands ? > > This is the same issue I have had since day #1. > > Generally mount/unmount or even fdisk -l fails after booting into > miniramfs. I wouldn't ever try to boot a distro. busybox ? > >> >> If you are "lucky", it is always the same type of command like for the >> NCQ NON DATA in my case. > > I'm just trying SAS disks to start with - so it's an SCSI READ command. > SATA/STP support is generally never as robust for SAS HBAs (HW and LLD > bugs are common - as this series is evidence) so I start on something > more basic - however SATA/STP also has this issue. > > The command is sent successfully but just never completes. Then > sometimes the TMFs for error handling timeout and sometimes succeed. I > don't have much to do on.... No SAS bus analyzer lying in a corner of the office ? :) That could help... I will go to the office Monday. So I will get a chance to add SAS drives to my setup to see what I get. I have only tested with SATA until now. My controller is not the same chip as yours though. > >> Though on mount, I would only expect a lot of >> read commands and not much else. > > Yes, and it is commonly the first SCSI read command which times out. It > reliably breaks quite early. So I can think we can rule out issues like > memory barriers/timing. > > There may be some writes and a flush >> too, so there will be "data" commands and "non data" commands. It may be >> an issue with non-data commands too ? >> > > Not sure on that. I guess it isn't. Anything special with the drives you are using ? Have you tried other drives to see if you get lucky ? > > Thanks, > John
Hi Damien, >> >>>> Sometimes I get TMF timeouts, which is a bad situation. I guess it's a >>>> subtle driver bug, but where ....? >>> What is the command failing ? Always the same ? Can you try adding scsi >>> trace to see the commands ? >> >> This is the same issue I have had since day #1. >> >> Generally mount/unmount or even fdisk -l fails after booting into >> miniramfs. I wouldn't ever try to boot a distro. > > busybox ? > Yes >> >>> >>> If you are "lucky", it is always the same type of command like for the >>> NCQ NON DATA in my case. >> >> I'm just trying SAS disks to start with - so it's an SCSI READ command. >> SATA/STP support is generally never as robust for SAS HBAs (HW and LLD >> bugs are common - as this series is evidence) so I start on something >> more basic - however SATA/STP also has this issue. >> >> The command is sent successfully but just never completes. Then >> sometimes the TMFs for error handling timeout and sometimes succeed. I >> don't have much to do on.... > > No SAS bus analyzer lying in a corner of the office ? :) > That could help... None unfortunately > > I will go to the office Monday. So I will get a chance to add SAS drives > to my setup to see what I get. I have only tested with SATA until now. > My controller is not the same chip as yours though. jfyi, Ajish, cc'ed, from microchip says that they see the same issue on their arm64 system. Unfortunately fixing it is not a priority for them. So it is something which arm64 is exposing. And I tried an old kernel - like 4.10 - on the same board and the pm8001 driver was working somewhat reliably (no hangs). It just looks like a driver issue. I'll have a look at the driver code again if I get a chance. It might be a DMA issue. > >> >>> Though on mount, I would only expect a lot of >>> read commands and not much else. >> >> Yes, and it is commonly the first SCSI read command which times out. It >> reliably breaks quite early. So I can think we can rule out issues like >> memory barriers/timing. >> >> There may be some writes and a flush >>> too, so there will be "data" commands and "non data" commands. It may be >>> an issue with non-data commands too ? >>> >> >> Not sure on that. I guess it isn't. > > Anything special with the drives you are using ? Have you tried other > drives to see if you get lucky ? > I think that the drives are ok. On the same board I originally had them plugged in the hisi_sas HBA and no issues. thanks, john
On 2/11/22 22:54, John Garry wrote: > Hi Damien, > >>> >>>>> Sometimes I get TMF timeouts, which is a bad situation. I guess it's a >>>>> subtle driver bug, but where ....? >>>> What is the command failing ? Always the same ? Can you try adding scsi >>>> trace to see the commands ? >>> >>> This is the same issue I have had since day #1. >>> >>> Generally mount/unmount or even fdisk -l fails after booting into >>> miniramfs. I wouldn't ever try to boot a distro. >> >> busybox ? >> > > Yes > >>> >>>> >>>> If you are "lucky", it is always the same type of command like for the >>>> NCQ NON DATA in my case. >>> >>> I'm just trying SAS disks to start with - so it's an SCSI READ command. >>> SATA/STP support is generally never as robust for SAS HBAs (HW and LLD >>> bugs are common - as this series is evidence) so I start on something >>> more basic - however SATA/STP also has this issue. >>> >>> The command is sent successfully but just never completes. Then >>> sometimes the TMFs for error handling timeout and sometimes succeed. I >>> don't have much to do on.... >> >> No SAS bus analyzer lying in a corner of the office ? :) >> That could help... > > None unfortunately > >> >> I will go to the office Monday. So I will get a chance to add SAS drives >> to my setup to see what I get. I have only tested with SATA until now. >> My controller is not the same chip as yours though. > > jfyi, Ajish, cc'ed, from microchip says that they see the same issue on > their arm64 system. Unfortunately fixing it is not a priority for them. > So it is something which arm64 is exposing. > > And I tried an old kernel - like 4.10 - on the same board and the pm8001 > driver was working somewhat reliably (no hangs). It just looks like a > driver issue. > > I'll have a look at the driver code again if I get a chance. It might be > a DMA issue. There is one more thing that I find strange in the driver and that may cause problems: tag 0 is a perfectly valid tag value that can be returned by pm8001_tag_alloc() since find_first_zero_bit() will return 0 if the first bit is 0. And yet, there are many places in the driver that treat !tag as an error. Extremely weird, if not outright broken... I patched that and tested and everything seems OK... Could it be that you are not seeing some completions because of that ? I added the patch to my v3. Will send Monday.