Message ID | 20230811140749.5202-2-qianweili@huawei.com |
---|---|
State | New |
Headers | show |
Series | crypto: hisilicon - fix some issues in hisilicon drivers | expand |
On Fri, Aug 11, 2023 at 10:07:43PM +0800, Weili Qian wrote: > The malibox needs to be triggered by a 128bit atomic operation. The > reason is that one QM hardware entity in one accelerator servers QM > mailbox MMIO interfaces in related PF and VFs. A mutex cannot lock > mailbox processes in different functions. When multiple functions access > the mailbox simultaneously, if the generic IO interface readq/writeq > is used to access the mailbox, the data read from mailbox or written to > mailbox is unpredictable. Therefore, the generic IO interface is changed > to a 128bit atomic operation. > > Signed-off-by: Weili Qian <qianweili@huawei.com> > --- > drivers/crypto/hisilicon/qm.c | 160 ++++++++++++++++++++++------------ > include/linux/hisi_acc_qm.h | 1 - > 2 files changed, 105 insertions(+), 56 deletions(-) > mode change 100644 => 100755 drivers/crypto/hisilicon/qm.c ... > - qm_mb_write(qm, mailbox); > +#if IS_ENABLED(CONFIG_ARM64) > + asm volatile("ldp %0, %1, %3\n" > + "stp %0, %1, %2\n" > + "dmb oshst\n" > + : "=&r" (tmp0), > + "=&r" (tmp1), > + "+Q" (*((char *)dst)) > + : "Q" (*((char __iomem *)fun_base)) > + : "memory"); > +#endif You should add a generic 128-bite write primitive for arm64 instead of doing it in raw assembly in the driver. Thanks,
On Fri, 18 Aug 2023 at 10:13, Herbert Xu <herbert@gondor.apana.org.au> wrote: > > On Fri, Aug 11, 2023 at 10:07:43PM +0800, Weili Qian wrote: > > The malibox needs to be triggered by a 128bit atomic operation. The > > reason is that one QM hardware entity in one accelerator servers QM > > mailbox MMIO interfaces in related PF and VFs. A mutex cannot lock > > mailbox processes in different functions. When multiple functions access > > the mailbox simultaneously, if the generic IO interface readq/writeq > > is used to access the mailbox, the data read from mailbox or written to > > mailbox is unpredictable. Therefore, the generic IO interface is changed > > to a 128bit atomic operation. > > > > Signed-off-by: Weili Qian <qianweili@huawei.com> > > --- > > drivers/crypto/hisilicon/qm.c | 160 ++++++++++++++++++++++------------ > > include/linux/hisi_acc_qm.h | 1 - > > 2 files changed, 105 insertions(+), 56 deletions(-) > > mode change 100644 => 100755 drivers/crypto/hisilicon/qm.c > > ... > > > - qm_mb_write(qm, mailbox); > > +#if IS_ENABLED(CONFIG_ARM64) > > + asm volatile("ldp %0, %1, %3\n" > > + "stp %0, %1, %2\n" > > + "dmb oshst\n" > > + : "=&r" (tmp0), > > + "=&r" (tmp1), > > + "+Q" (*((char *)dst)) > > + : "Q" (*((char __iomem *)fun_base)) > > + : "memory"); > > +#endif > > You should add a generic 128-bite write primitive for arm64 instead > of doing it in raw assembly in the driver. > IIRC there have been other cases (ThunderX?) where 128 bit MMIO accessors were needed for some peripheral, but common arm64 helpers were rejected on the basis that this atomic behavior is not architectural. Obviously, using inline asm in the driver is not the right way either, so perhaps we should consider introducing some common infrastructure anyway, including some expectation management about their guarantees.
On Fri, Aug 18, 2023 at 12:21:02PM +0200, Ard Biesheuvel wrote: > > IIRC there have been other cases (ThunderX?) where 128 bit MMIO > accessors were needed for some peripheral, but common arm64 helpers > were rejected on the basis that this atomic behavior is not > architectural. > > Obviously, using inline asm in the driver is not the right way either, > so perhaps we should consider introducing some common infrastructure > anyway, including some expectation management about their guarantees. The ones in drivers/net/ethernet/marvell/octeontx2/nic/otx2_common.h look better. So perhaps copy them into hisilicon? Cheers,
On Sat, 19 Aug 2023 at 06:12, Herbert Xu <herbert@gondor.apana.org.au> wrote: > > On Fri, Aug 18, 2023 at 12:21:02PM +0200, Ard Biesheuvel wrote: > > > > IIRC there have been other cases (ThunderX?) where 128 bit MMIO > > accessors were needed for some peripheral, but common arm64 helpers > > were rejected on the basis that this atomic behavior is not > > architectural. > > > > Obviously, using inline asm in the driver is not the right way either, > > so perhaps we should consider introducing some common infrastructure > > anyway, including some expectation management about their guarantees. > > The ones in > > drivers/net/ethernet/marvell/octeontx2/nic/otx2_common.h > > look better. So perhaps copy them into hisilicon? > No, that otx2_write128() routine looks buggy, actually, The ! at the end means writeback, and so the register holding addr will be modified, which is not reflect in the asm constraints. It also lacks a barrier. The generic version just ORs the low and high qwords together, so it obviously only exists for compile coverage (and the generic atomic add is clearly broken too)
On Sat, Aug 19, 2023 at 09:33:18AM +0200, Ard Biesheuvel wrote: > > No, that otx2_write128() routine looks buggy, actually, The ! at the > end means writeback, and so the register holding addr will be > modified, which is not reflect in the asm constraints. It also lacks a > barrier. OK. But at least having a helper called write128 looks a lot cleaner than just having unexplained assembly in the code. Cheers,
On Mon, Aug 21, 2023 at 04:19:54PM +0800, Herbert Xu wrote: > On Sat, Aug 19, 2023 at 09:33:18AM +0200, Ard Biesheuvel wrote: > > > > No, that otx2_write128() routine looks buggy, actually, The ! at the > > end means writeback, and so the register holding addr will be > > modified, which is not reflect in the asm constraints. It also lacks a > > barrier. > > OK. But at least having a helper called write128 looks a lot > cleaner than just having unexplained assembly in the code. I guess we want something similar to how writeq() is handled on 32-bit architectures (see include/linux/io-64-nonatomic-{hi-lo,lo-hi}.h. It's then CPU-dependent on whether you get atomicity. Will
On 2023/8/21 18:26, Will Deacon wrote: > On Mon, Aug 21, 2023 at 04:19:54PM +0800, Herbert Xu wrote: >> On Sat, Aug 19, 2023 at 09:33:18AM +0200, Ard Biesheuvel wrote: >>> >>> No, that otx2_write128() routine looks buggy, actually, The ! at the >>> end means writeback, and so the register holding addr will be >>> modified, which is not reflect in the asm constraints. It also lacks a >>> barrier. >> >> OK. But at least having a helper called write128 looks a lot >> cleaner than just having unexplained assembly in the code. > > I guess we want something similar to how writeq() is handled on 32-bit > architectures (see include/linux/io-64-nonatomic-{hi-lo,lo-hi}.h. > > It's then CPU-dependent on whether you get atomicity. > > Will > . > Thanks for the review. Since the HiSilicon accelerator devices are supported only on the ARM64 platform, the following 128bit read and write interfaces are added to the driver, is this OK? #if defined(CONFIG_ARM64) static void qm_write128(void __iomem *addr, const void *buffer) { unsigned long tmp0 = 0, tmp1 = 0; asm volatile("ldp %0, %1, %3\n" "stp %0, %1, %2\n" "dmb oshst\n" : "=&r" (tmp0), "=&r" (tmp1), "+Q" (*((char __iomem *)addr)) : "Q" (*((char *)buffer)) : "memory"); } static void qm_read128(void *buffer, const void __iomem *addr) { unsigned long tmp0 = 0, tmp1 = 0; asm volatile("ldp %0, %1, %3\n" "stp %0, %1, %2\n" "dmb oshst\n" : "=&r" (tmp0), "=&r" (tmp1), "+Q" (*((char *)buffer)) : "Q" (*((char __iomem *)addr)) : "memory"); } #else static void qm_write128(void __iomem *addr, const void *buffer) { } static void qm_read128(void *buffer, const void __iomem *addr) { } #endif Thanks, Weili
On Mon, 21 Aug 2023 at 14:45, Weili Qian <qianweili@huawei.com> wrote: > > > > On 2023/8/21 18:26, Will Deacon wrote: > > On Mon, Aug 21, 2023 at 04:19:54PM +0800, Herbert Xu wrote: > >> On Sat, Aug 19, 2023 at 09:33:18AM +0200, Ard Biesheuvel wrote: > >>> > >>> No, that otx2_write128() routine looks buggy, actually, The ! at the > >>> end means writeback, and so the register holding addr will be > >>> modified, which is not reflect in the asm constraints. It also lacks a > >>> barrier. > >> > >> OK. But at least having a helper called write128 looks a lot > >> cleaner than just having unexplained assembly in the code. > > > > I guess we want something similar to how writeq() is handled on 32-bit > > architectures (see include/linux/io-64-nonatomic-{hi-lo,lo-hi}.h. > > > > It's then CPU-dependent on whether you get atomicity. > > > > Will > > . > > > > Thanks for the review. > > Since the HiSilicon accelerator devices are supported only on the ARM64 platform, > the following 128bit read and write interfaces are added to the driver, is this OK? No, this does not belong in the driver. As Will is suggesting, we should define some generic helpers for this, and provide an arm64 implementation if the generic one does not result in the correct codegen. > > #if defined(CONFIG_ARM64) > static void qm_write128(void __iomem *addr, const void *buffer) > { > unsigned long tmp0 = 0, tmp1 = 0; > > asm volatile("ldp %0, %1, %3\n" > "stp %0, %1, %2\n" > "dmb oshst\n" > : "=&r" (tmp0), > "=&r" (tmp1), > "+Q" (*((char __iomem *)addr)) This constraint describes the first byte of addr, which is sloppy but probably works fine. > : "Q" (*((char *)buffer)) This constraint describes the first byte of buffer, which might cause problems because the asm reads the entire buffer not just the first byte. > : "memory"); > } > > static void qm_read128(void *buffer, const void __iomem *addr) > { > unsigned long tmp0 = 0, tmp1 = 0; > > asm volatile("ldp %0, %1, %3\n" > "stp %0, %1, %2\n" > "dmb oshst\n" Is this the right barrier for a read? > : "=&r" (tmp0), > "=&r" (tmp1), > "+Q" (*((char *)buffer)) > : "Q" (*((char __iomem *)addr)) > : "memory"); > } > > #else > static void qm_write128(void __iomem *addr, const void *buffer) > { > > } > > static void qm_read128(void *buffer, const void __iomem *addr) > { > > } > #endif > Have you tried using __uint128_t accesses instead? E.g., something like static void qm_write128(void __iomem *addr, const void *buffer) { volatile __uint128_t *dst = (volatile __uint128_t __force *)addr; const __uint128_t *src __aligned(1) = buffer; WRITE_ONCE(*dst, *src); dma_wmb(); } should produce the right instruction sequence, and works on all architectures that have CONFIG_ARCH_SUPPORTS_INT128=y This needs a big fat comment describing that the MMIO access needs to be atomic, but we could consider it as a fallback if we decide not to bother with special MMIO accessors, although I suspect we'll be needing them more widely at some point anyway.
On Mon, Aug 21, 2023, at 08:45, Weili Qian wrote: > On 2023/8/21 18:26, Will Deacon wrote: >> On Mon, Aug 21, 2023 at 04:19:54PM +0800, Herbert Xu wrote: > Thanks for the review. > > Since the HiSilicon accelerator devices are supported only on the ARM64 > platform, > the following 128bit read and write interfaces are added to the driver, > is this OK? > > #if defined(CONFIG_ARM64) > static void qm_write128(void __iomem *addr, const void *buffer) > { The naming makes it specific to the driver, which is not appropriate for a global definition. Just follow the generic naming. I guess you wouldn't have to do both the readl/writel and the ioread32/iowrite32 variants, so just start with the ioread/iowrite version. That also avoids having to come up with a new letter. You have the arguments in the wrong order compared to iowrite32(), which is very confusing. Instead of the pointer to the buffer, I'd suggest passing the value by value here, to make it behave like the other ones. This does mean it won't build on targets that don't define u128, but I think we can handle that in a Kconfig symbol. > unsigned long tmp0 = 0, tmp1 = 0; Don't initialize local variable to zero, that is generally a bad idea because it hides cases where they are not initialized properly. > asm volatile("ldp %0, %1, %3\n" > "stp %0, %1, %2\n" This is missing the endian-swap for big-endian kernels. > "dmb oshst\n" You have the barrier at the wrong end of the helper, it needs to before the actual store to have the intended effect. Also, you should really use the generic __io_bw() helper rather than open-coding it. > : "=&r" (tmp0), > "=&r" (tmp1), The tmp0/tmp1 registers are technically a clobber, not an in/out, though ideally these should be turned into inputs. > "+Q" (*((char __iomem *)addr)) > : "Q" (*((char *)buffer)) wrong length > : "memory"); > } The memory clobber is usually part of the barrier. > static void qm_read128(void *buffer, const void __iomem *addr) > { > unsigned long tmp0 = 0, tmp1 = 0; > > asm volatile("ldp %0, %1, %3\n" > "stp %0, %1, %2\n" > "dmb oshst\n" > : "=&r" (tmp0), > "=&r" (tmp1), > "+Q" (*((char *)buffer)) > : "Q" (*((char __iomem *)addr)) > : "memory"); > } Same thing, but you are missing the control dependency from __io_ar() here, rather than just open-coding it. > #else > static void qm_write128(void __iomem *addr, const void *buffer) > { > > } This is missing the entire implementation? Arnd
On 2023/8/21 22:36, Ard Biesheuvel wrote: > On Mon, 21 Aug 2023 at 14:45, Weili Qian <qianweili@huawei.com> wrote: >> >> >> >> On 2023/8/21 18:26, Will Deacon wrote: >>> On Mon, Aug 21, 2023 at 04:19:54PM +0800, Herbert Xu wrote: >>>> On Sat, Aug 19, 2023 at 09:33:18AM +0200, Ard Biesheuvel wrote: >>>>> >>>>> No, that otx2_write128() routine looks buggy, actually, The ! at the >>>>> end means writeback, and so the register holding addr will be >>>>> modified, which is not reflect in the asm constraints. It also lacks a >>>>> barrier. >>>> >>>> OK. But at least having a helper called write128 looks a lot >>>> cleaner than just having unexplained assembly in the code. >>> >>> I guess we want something similar to how writeq() is handled on 32-bit >>> architectures (see include/linux/io-64-nonatomic-{hi-lo,lo-hi}.h. >>> >>> It's then CPU-dependent on whether you get atomicity. >>> >>> Will >>> . >>> >> >> Thanks for the review. >> >> Since the HiSilicon accelerator devices are supported only on the ARM64 platform, >> the following 128bit read and write interfaces are added to the driver, is this OK? > > No, this does not belong in the driver. As Will is suggesting, we > should define some generic helpers for this, and provide an arm64 > implementation if the generic one does not result in the correct > codegen. > Sorry, I misunderstood here. >> >> #if defined(CONFIG_ARM64) >> static void qm_write128(void __iomem *addr, const void *buffer) >> { >> unsigned long tmp0 = 0, tmp1 = 0; >> >> asm volatile("ldp %0, %1, %3\n" >> "stp %0, %1, %2\n" >> "dmb oshst\n" >> : "=&r" (tmp0), >> "=&r" (tmp1), >> "+Q" (*((char __iomem *)addr)) > > This constraint describes the first byte of addr, which is sloppy but > probably works fine. > >> : "Q" (*((char *)buffer)) > > This constraint describes the first byte of buffer, which might cause > problems because the asm reads the entire buffer not just the first > byte. I don't understand why constraint describes the first byte of buffer,and the compilation result seems to be ok. 1811 1afc: a9400a61 ldp x1, x2, [x19] 1812 1b00: a9000801 stp x1, x2, [x0] 1813 1b04: d50332bf dmb oshst Maybe I'm wrong about 'Q', could you explain it or where can I learn more about this? > >> : "memory"); >> } >> >> static void qm_read128(void *buffer, const void __iomem *addr) >> { >> unsigned long tmp0 = 0, tmp1 = 0; >> >> asm volatile("ldp %0, %1, %3\n" >> "stp %0, %1, %2\n" >> "dmb oshst\n" > > Is this the right barrier for a read? Should be "dmb oshld\n". > >> : "=&r" (tmp0), >> "=&r" (tmp1), >> "+Q" (*((char *)buffer)) >> : "Q" (*((char __iomem *)addr)) >> : "memory"); >> } >> >> #else >> static void qm_write128(void __iomem *addr, const void *buffer) >> { >> >> } >> >> static void qm_read128(void *buffer, const void __iomem *addr) >> { >> >> } >> #endif >> > > Have you tried using __uint128_t accesses instead? > > E.g., something like > > static void qm_write128(void __iomem *addr, const void *buffer) > { > volatile __uint128_t *dst = (volatile __uint128_t __force *)addr; > const __uint128_t *src __aligned(1) = buffer; > > WRITE_ONCE(*dst, *src); > dma_wmb(); > } > > should produce the right instruction sequence, and works on all > architectures that have CONFIG_ARCH_SUPPORTS_INT128=y > I tried this, but WRITE_ONCE does not support type __uint128_t. ->WRITE_ONCE ->compiletime_assert_rwonce_type ->compiletime_assert(__native_word(t) || sizeof(t) == sizeof(long long), \ "Unsupported access size for {READ,WRITE}_ONCE().") So can we define generic IO helpers based on patchset https://lore.kernel.org/all/20180124090519.6680-4-ynorov@caviumnetworks.com/ Part of the implementation is as follows: add writeo() in include/asm-generic/io.h #ifdef CONFIG_ARCH_SUPPORTS_INT128 #ifndef writeo #define writeo writeo static inline void writeo(__uint128_t value, volatile void __iomem *addr) { __io_bw(); __raw_writeo((__uint128_t __force)__cpu_to_le128(value), addr); //__cpu_to_le128 will implement. __io_aw(); } #endif #endif /* CONFIG_ARCH_SUPPORTS_INT128 */ #ifdef CONFIG_ARCH_SUPPORTS_INT128 #ifndef iowrite128 #define iowrite128 iowrite128 static inline void iowrite128(__uint128_t value, volatile void __iomem *addr) { writeo(value, addr); } #endif #endif /* CONFIG_ARCH_SUPPORTS_INT128 */ in arch/arm64/include/asm/io.h #ifdef CONFIG_ARCH_SUPPORTS_INT128 #define __raw_writeo __raw_writeo static __always_inline void __raw_writeo(__uint128_t val, volatile void __iomem *addr) { u64 hi, lo; lo = (u64)val; hi = (u64)(val >> 64); asm volatile ("stp %x0, %x1, [%2]\n" :: "rZ"(lo), "rZ"(hi), "r"(addr)); } #endif /* CONFIG_ARCH_SUPPORTS_INT128 */ And add io{read|write}128bits in include/linux/io-64-nonatomic-{hi-lo,lo-hi}.h. static inline void lo_hi_writeo(__uint128_t val, volatile void __iomem *addr) { writeq(val, addr); writeq(val >> 64, addr); } #ifndef writeo #define writeo lo_hi_writeo #endif static inline void hi_lo_writeo(__uint128_t val, volatile void __iomem *addr) { writeq(val >> 64, addr); writeq(val, addr); } #ifndef writeo #define writeo hi_lo_writeo #endif If this is ok, I'm going to implement reado and writeo, then submit the patchset. Thanks, Weili > This needs a big fat comment describing that the MMIO access needs to > be atomic, but we could consider it as a fallback if we decide not to > bother with special MMIO accessors, although I suspect we'll be > needing them more widely at some point anyway. > . >
On 2023/8/22 4:47, Arnd Bergmann wrote: > On Mon, Aug 21, 2023, at 08:45, Weili Qian wrote: >> On 2023/8/21 18:26, Will Deacon wrote: >>> On Mon, Aug 21, 2023 at 04:19:54PM +0800, Herbert Xu wrote: > >> Thanks for the review. >> >> Since the HiSilicon accelerator devices are supported only on the ARM64 >> platform, >> the following 128bit read and write interfaces are added to the driver, >> is this OK? >> >> #if defined(CONFIG_ARM64) >> static void qm_write128(void __iomem *addr, const void *buffer) >> { > > The naming makes it specific to the driver, which is not > appropriate for a global definition. Just follow the > generic naming. I guess you wouldn't have to do both > the readl/writel and the ioread32/iowrite32 variants, so > just start with the ioread/iowrite version. That also > avoids having to come up with a new letter. > > You have the arguments in the wrong order compared to iowrite32(), > which is very confusing. > > Instead of the pointer to the buffer, I'd suggest passing the > value by value here, to make it behave like the other ones. > > This does mean it won't build on targets that don't > define u128, but I think we can handle that in a Kconfig > symbol. Ok, I will add generic IO helpers ioread128/iowrite128 for this, keep it consistent with ioread32/iowrite32, submit patchset later. And remove them from the driver. > >> unsigned long tmp0 = 0, tmp1 = 0; > > Don't initialize local variable to zero, that is generally > a bad idea because it hides cases where they are not > initialized properly. > >> asm volatile("ldp %0, %1, %3\n" >> "stp %0, %1, %2\n" > > This is missing the endian-swap for big-endian kernels. The input parameter data has been endian-swap. > >> "dmb oshst\n" > > You have the barrier at the wrong end of the helper, it > needs to before the actual store to have the intended > effect. > > Also, you should really use the generic __io_bw() helper > rather than open-coding it. OK. > >> : "=&r" (tmp0), >> "=&r" (tmp1), > > The tmp0/tmp1 registers are technically a clobber, not > an in/out, though ideally these should be turned > into inputs. > >> "+Q" (*((char __iomem *)addr)) >> : "Q" (*((char *)buffer)) > > wrong length > >> : "memory"); >> } > > The memory clobber is usually part of the barrier. Yeah, the memory can be removed. > >> static void qm_read128(void *buffer, const void __iomem *addr) >> { >> unsigned long tmp0 = 0, tmp1 = 0; >> >> asm volatile("ldp %0, %1, %3\n" >> "stp %0, %1, %2\n" >> "dmb oshst\n" >> : "=&r" (tmp0), >> "=&r" (tmp1), >> "+Q" (*((char *)buffer)) >> : "Q" (*((char __iomem *)addr)) >> : "memory"); >> } > > Same thing, but you are missing the control dependency > from __io_ar() here, rather than just open-coding it. > >> #else >> static void qm_write128(void __iomem *addr, const void *buffer) >> { >> >> } > > This is missing the entire implementation? If the interface is implemented in the driver, the driver runs only on the ARM64 platform. Therefore, there is no need to implement. > > Arnd > . > Thanks, Weili
On Thu, Aug 24, 2023, at 23:07, Weili Qian wrote: > On 2023/8/21 22:36, Ard Biesheuvel wrote: >> On Mon, 21 Aug 2023 at 14:45, Weili Qian <qianweili@huawei.com> wrote: >>> : "Q" (*((char *)buffer)) >> >> This constraint describes the first byte of buffer, which might cause >> problems because the asm reads the entire buffer not just the first >> byte. > I don't understand why constraint describes the first byte of > buffer,and the compilation result seems to be ok. > > 1811 1afc: a9400a61 ldp x1, x2, [x19] > 1812 1b00: a9000801 stp x1, x2, [x0] > 1813 1b04: d50332bf dmb oshst > Maybe I'm wrong about 'Q', could you explain it or where can I learn > more about this? The "Q" is not the problem here, it's the cast to (char *), which tells the compiler that only the first byte is used here, and allows it to not actually store the rest of the buffer into memory. It's not a problem on the __iomem pointer side, since gcc never accesses those directly, and for the version taking a __u128 literal or two __u64 registers it is also ok. >>> unsigned long tmp0 = 0, tmp1 = 0; >>> >>> asm volatile("ldp %0, %1, %3\n" >>> "stp %0, %1, %2\n" >>> "dmb oshst\n" >> >> Is this the right barrier for a read? > Should be "dmb oshld\n". As I said, this needs to be __io_ar(), which might be defined in a different way. >> >> Have you tried using __uint128_t accesses instead? >> >> E.g., something like >> >> static void qm_write128(void __iomem *addr, const void *buffer) >> { >> volatile __uint128_t *dst = (volatile __uint128_t __force *)addr; >> const __uint128_t *src __aligned(1) = buffer; >> >> WRITE_ONCE(*dst, *src); >> dma_wmb(); >> } >> >> should produce the right instruction sequence, and works on all >> architectures that have CONFIG_ARCH_SUPPORTS_INT128=y >> > > I tried this, but WRITE_ONCE does not support type __uint128_t. > ->WRITE_ONCE > ->compiletime_assert_rwonce_type > ->compiletime_assert(__native_word(t) || sizeof(t) == sizeof(long long), \ > "Unsupported access size for {READ,WRITE}_ONCE().") On top of that, WRITE_ONCE() does not guarantee atomicity, and dma_wmb() might not be the correct barrier. > So can we define generic IO helpers based on patchset > https://lore.kernel.org/all/20180124090519.6680-4-ynorov@caviumnetworks.com/ > Part of the implementation is as follows: > > add writeo() in include/asm-generic/io.h > > #ifdef CONFIG_ARCH_SUPPORTS_INT128 > #ifndef writeo > #define writeo writeo > static inline void writeo(__uint128_t value, volatile void __iomem > *addr) > { > __io_bw(); > __raw_writeo((__uint128_t __force)__cpu_to_le128(value), addr); > //__cpu_to_le128 will implement. > __io_aw(); > } > #endif > #endif /* CONFIG_ARCH_SUPPORTS_INT128 */ Right, this is fairly close to what we need. The 'o' notation might be slightly controversial, which is why I suggested definining only iowrite128() to avoid having to agree on the correct letter for 16-byte stores. > in arch/arm64/include/asm/io.h > > #ifdef CONFIG_ARCH_SUPPORTS_INT128 > #define __raw_writeo __raw_writeo > static __always_inline void __raw_writeo(__uint128_t val, volatile > void __iomem *addr) > { > u64 hi, lo; > > lo = (u64)val; > hi = (u64)(val >> 64); > > asm volatile ("stp %x0, %x1, [%2]\n" :: "rZ"(lo), "rZ"(hi), "r"(addr)); > } > #endif /* CONFIG_ARCH_SUPPORTS_INT128 */ This definition looks fine. > And add io{read|write}128bits in include/linux/io-64-nonatomic-{hi-lo,lo-hi}.h. > static inline void lo_hi_writeo(__uint128_t val, volatile void __iomem *addr) > { > writeq(val, addr); > writeq(val >> 64, addr); > } This also looks fine. Arnd
On 2023/8/30 3:37, Arnd Bergmann wrote: > On Thu, Aug 24, 2023, at 23:07, Weili Qian wrote: >> On 2023/8/21 22:36, Ard Biesheuvel wrote: >>> On Mon, 21 Aug 2023 at 14:45, Weili Qian <qianweili@huawei.com> wrote: > >>>> : "Q" (*((char *)buffer)) >>> >>> This constraint describes the first byte of buffer, which might cause >>> problems because the asm reads the entire buffer not just the first >>> byte. >> I don't understand why constraint describes the first byte of >> buffer,and the compilation result seems to be ok. >> >> 1811 1afc: a9400a61 ldp x1, x2, [x19] >> 1812 1b00: a9000801 stp x1, x2, [x0] >> 1813 1b04: d50332bf dmb oshst >> Maybe I'm wrong about 'Q', could you explain it or where can I learn >> more about this? > > The "Q" is not the problem here, it's the cast to (char *), which > tells the compiler that only the first byte is used here, and > allows it to not actually store the rest of the buffer into > memory. > > It's not a problem on the __iomem pointer side, since gcc never > accesses those directly, and for the version taking a __u128 literal > or two __u64 registers it is also ok. Thanks for your reply, I have got it. > >>>> unsigned long tmp0 = 0, tmp1 = 0; >>>> >>>> asm volatile("ldp %0, %1, %3\n" >>>> "stp %0, %1, %2\n" >>>> "dmb oshst\n" >>> >>> Is this the right barrier for a read? >> Should be "dmb oshld\n". > > As I said, this needs to be __io_ar(), which might be > defined in a different way. > >>> >>> Have you tried using __uint128_t accesses instead? >>> >>> E.g., something like >>> >>> static void qm_write128(void __iomem *addr, const void *buffer) >>> { >>> volatile __uint128_t *dst = (volatile __uint128_t __force *)addr; >>> const __uint128_t *src __aligned(1) = buffer; >>> >>> WRITE_ONCE(*dst, *src); >>> dma_wmb(); >>> } >>> >>> should produce the right instruction sequence, and works on all >>> architectures that have CONFIG_ARCH_SUPPORTS_INT128=y >>> >> >> I tried this, but WRITE_ONCE does not support type __uint128_t. >> ->WRITE_ONCE >> ->compiletime_assert_rwonce_type >> ->compiletime_assert(__native_word(t) || sizeof(t) == sizeof(long long), \ >> "Unsupported access size for {READ,WRITE}_ONCE().") > > On top of that, WRITE_ONCE() does not guarantee atomicity, and > dma_wmb() might not be the correct barrier. > >> So can we define generic IO helpers based on patchset >> https://lore.kernel.org/all/20180124090519.6680-4-ynorov@caviumnetworks.com/ >> Part of the implementation is as follows: >> >> add writeo() in include/asm-generic/io.h >> >> #ifdef CONFIG_ARCH_SUPPORTS_INT128 >> #ifndef writeo >> #define writeo writeo >> static inline void writeo(__uint128_t value, volatile void __iomem >> *addr) >> { >> __io_bw(); >> __raw_writeo((__uint128_t __force)__cpu_to_le128(value), addr); >> //__cpu_to_le128 will implement. >> __io_aw(); >> } >> #endif >> #endif /* CONFIG_ARCH_SUPPORTS_INT128 */ > > Right, this is fairly close to what we need. The 'o' notation > might be slightly controversial, which is why I suggested > definining only iowrite128() to avoid having to agree on > the correct letter for 16-byte stores. Okay, I'll just define iowrite128() and ioread128(). Thanks, Weili > >> in arch/arm64/include/asm/io.h >> >> #ifdef CONFIG_ARCH_SUPPORTS_INT128 >> #define __raw_writeo __raw_writeo >> static __always_inline void __raw_writeo(__uint128_t val, volatile >> void __iomem *addr) >> { >> u64 hi, lo; >> >> lo = (u64)val; >> hi = (u64)(val >> 64); >> >> asm volatile ("stp %x0, %x1, [%2]\n" :: "rZ"(lo), "rZ"(hi), "r"(addr)); >> } >> #endif /* CONFIG_ARCH_SUPPORTS_INT128 */ > > This definition looks fine. > >> And add io{read|write}128bits in include/linux/io-64-nonatomic-{hi-lo,lo-hi}.h. >> static inline void lo_hi_writeo(__uint128_t val, volatile void __iomem *addr) >> { >> writeq(val, addr); >> writeq(val >> 64, addr); >> } > > This also looks fine. > > Arnd > . >
diff --git a/drivers/crypto/hisilicon/qm.c b/drivers/crypto/hisilicon/qm.c old mode 100644 new mode 100755 index a99fd589445c..13cd2617170e --- a/drivers/crypto/hisilicon/qm.c +++ b/drivers/crypto/hisilicon/qm.c @@ -33,6 +33,10 @@ #define QM_MB_CMD_DATA_SHIFT 32 #define QM_MB_CMD_DATA_MASK GENMASK(31, 0) #define QM_MB_STATUS_MASK GENMASK(12, 9) +#define QM_MB_BUSY_MASK BIT(13) +#define QM_MB_SIZE 16 +#define QM_MB_MAX_WAIT_CNT 6000 +#define QM_MB_WAIT_READY_CNT 10 /* sqc shift */ #define QM_SQ_HOP_NUM_SHIFT 0 @@ -597,17 +601,6 @@ static void qm_mb_pre_init(struct qm_mailbox *mailbox, u8 cmd, mailbox->rsvd = 0; } -/* return 0 mailbox ready, -ETIMEDOUT hardware timeout */ -int hisi_qm_wait_mb_ready(struct hisi_qm *qm) -{ - u32 val; - - return readl_relaxed_poll_timeout(qm->io_base + QM_MB_CMD_SEND_BASE, - val, !((val >> QM_MB_BUSY_SHIFT) & - 0x1), POLL_PERIOD, POLL_TIMEOUT); -} -EXPORT_SYMBOL_GPL(hisi_qm_wait_mb_ready); - /* 128 bit should be written to hardware at one time to trigger a mailbox */ static void qm_mb_write(struct hisi_qm *qm, const void *src) { @@ -618,7 +611,7 @@ static void qm_mb_write(struct hisi_qm *qm, const void *src) #endif if (!IS_ENABLED(CONFIG_ARM64)) { - memcpy_toio(fun_base, src, 16); + memcpy_toio(fun_base, src, QM_MB_SIZE); dma_wmb(); return; } @@ -635,35 +628,95 @@ static void qm_mb_write(struct hisi_qm *qm, const void *src) #endif } -static int qm_mb_nolock(struct hisi_qm *qm, struct qm_mailbox *mailbox) +/* 128 bit should be read from hardware at one time */ +static void qm_mb_read(struct hisi_qm *qm, void *dst) { - int ret; - u32 val; + const void __iomem *fun_base = qm->io_base + QM_MB_CMD_SEND_BASE; + +#if IS_ENABLED(CONFIG_ARM64) + unsigned long tmp0 = 0, tmp1 = 0; +#endif - if (unlikely(hisi_qm_wait_mb_ready(qm))) { - dev_err(&qm->pdev->dev, "QM mailbox is busy to start!\n"); - ret = -EBUSY; - goto mb_busy; + if (!IS_ENABLED(CONFIG_ARM64)) { + memcpy_fromio(dst, fun_base, QM_MB_SIZE); + dma_wmb(); + return; } - qm_mb_write(qm, mailbox); +#if IS_ENABLED(CONFIG_ARM64) + asm volatile("ldp %0, %1, %3\n" + "stp %0, %1, %2\n" + "dmb oshst\n" + : "=&r" (tmp0), + "=&r" (tmp1), + "+Q" (*((char *)dst)) + : "Q" (*((char __iomem *)fun_base)) + : "memory"); +#endif +} + +int hisi_qm_wait_mb_ready(struct hisi_qm *qm) +{ + struct qm_mailbox mailbox; + int i = 0; + + while (i++ < QM_MB_WAIT_READY_CNT) { + qm_mb_read(qm, &mailbox); + if (!(le16_to_cpu(mailbox.w0) & QM_MB_BUSY_MASK)) + return 0; - if (unlikely(hisi_qm_wait_mb_ready(qm))) { - dev_err(&qm->pdev->dev, "QM mailbox operation timeout!\n"); - ret = -ETIMEDOUT; - goto mb_busy; + usleep_range(WAIT_PERIOD_US_MIN, WAIT_PERIOD_US_MAX); } - val = readl(qm->io_base + QM_MB_CMD_SEND_BASE); - if (val & QM_MB_STATUS_MASK) { - dev_err(&qm->pdev->dev, "QM mailbox operation failed!\n"); - ret = -EIO; - goto mb_busy; + dev_err(&qm->pdev->dev, "QM mailbox is busy to start!\n"); + + return -EBUSY; +} +EXPORT_SYMBOL_GPL(hisi_qm_wait_mb_ready); + +static int qm_wait_mb_finish(struct hisi_qm *qm, struct qm_mailbox *mailbox) +{ + struct device *dev = &qm->pdev->dev; + int i = 0; + + while (++i) { + qm_mb_read(qm, mailbox); + if (!(le16_to_cpu(mailbox->w0) & QM_MB_BUSY_MASK)) + break; + + if (i == QM_MB_MAX_WAIT_CNT) { + dev_err(dev, "QM mailbox operation timeout!\n"); + return -ETIMEDOUT; + } + + usleep_range(WAIT_PERIOD_US_MIN, WAIT_PERIOD_US_MAX); + } + + if (le16_to_cpu(mailbox->w0) & QM_MB_STATUS_MASK) { + dev_err(dev, "QM mailbox operation failed!\n"); + return -EIO; } return 0; +} + +static int qm_mb_nolock(struct hisi_qm *qm, struct qm_mailbox *mailbox) +{ + int ret; -mb_busy: + ret = hisi_qm_wait_mb_ready(qm); + if (ret) + goto mb_err_cnt_increase; + + qm_mb_write(qm, mailbox); + + ret = qm_wait_mb_finish(qm, mailbox); + if (ret) + goto mb_err_cnt_increase; + + return 0; + +mb_err_cnt_increase: atomic64_inc(&qm->debug.dfx.mb_err_cnt); return ret; } @@ -687,6 +740,24 @@ int hisi_qm_mb(struct hisi_qm *qm, u8 cmd, dma_addr_t dma_addr, u16 queue, } EXPORT_SYMBOL_GPL(hisi_qm_mb); +static int hisi_qm_mb_read(struct hisi_qm *qm, u64 *base, u8 cmd, u16 queue) +{ + struct qm_mailbox mailbox; + int ret; + + qm_mb_pre_init(&mailbox, cmd, 0, queue, 1); + mutex_lock(&qm->mailbox_lock); + ret = qm_mb_nolock(qm, &mailbox); + mutex_unlock(&qm->mailbox_lock); + if (ret) + return ret; + + *base = le32_to_cpu(mailbox.base_l) | + ((u64)le32_to_cpu(mailbox.base_h) << 32); + + return 0; +} + static void qm_db_v1(struct hisi_qm *qm, u16 qn, u8 cmd, u16 index, u8 priority) { u64 doorbell; @@ -1308,12 +1379,10 @@ static int qm_get_vft_v2(struct hisi_qm *qm, u32 *base, u32 *number) u64 sqc_vft; int ret; - ret = hisi_qm_mb(qm, QM_MB_CMD_SQC_VFT_V2, 0, 0, 1); + ret = hisi_qm_mb_read(qm, &sqc_vft, QM_MB_CMD_SQC_VFT_V2, 0); if (ret) return ret; - sqc_vft = readl(qm->io_base + QM_MB_CMD_DATA_ADDR_L) | - ((u64)readl(qm->io_base + QM_MB_CMD_DATA_ADDR_H) << 32); *base = QM_SQC_VFT_BASE_MASK_V2 & (sqc_vft >> QM_SQC_VFT_BASE_SHIFT_V2); *number = (QM_SQC_VFT_NUM_MASK_V2 & (sqc_vft >> QM_SQC_VFT_NUM_SHIFT_V2)) + 1; @@ -1484,25 +1553,6 @@ static enum acc_err_result qm_hw_error_handle_v2(struct hisi_qm *qm) return ACC_ERR_RECOVERED; } -static int qm_get_mb_cmd(struct hisi_qm *qm, u64 *msg, u16 fun_num) -{ - struct qm_mailbox mailbox; - int ret; - - qm_mb_pre_init(&mailbox, QM_MB_CMD_DST, 0, fun_num, 0); - mutex_lock(&qm->mailbox_lock); - ret = qm_mb_nolock(qm, &mailbox); - if (ret) - goto err_unlock; - - *msg = readl(qm->io_base + QM_MB_CMD_DATA_ADDR_L) | - ((u64)readl(qm->io_base + QM_MB_CMD_DATA_ADDR_H) << 32); - -err_unlock: - mutex_unlock(&qm->mailbox_lock); - return ret; -} - static void qm_clear_cmd_interrupt(struct hisi_qm *qm, u64 vf_mask) { u32 val; @@ -1522,7 +1572,7 @@ static void qm_handle_vf_msg(struct hisi_qm *qm, u32 vf_id) u64 msg; int ret; - ret = qm_get_mb_cmd(qm, &msg, vf_id); + ret = hisi_qm_mb_read(qm, &msg, QM_MB_CMD_DST, vf_id); if (ret) { dev_err(dev, "failed to get msg from VF(%u)!\n", vf_id); return; @@ -4755,7 +4805,7 @@ static int qm_wait_pf_reset_finish(struct hisi_qm *qm) * Whether message is got successfully, * VF needs to ack PF by clearing the interrupt. */ - ret = qm_get_mb_cmd(qm, &msg, 0); + ret = hisi_qm_mb_read(qm, &msg, QM_MB_CMD_DST, 0); qm_clear_cmd_interrupt(qm, 0); if (ret) { dev_err(dev, "failed to get msg from PF in reset done!\n"); @@ -4809,7 +4859,7 @@ static void qm_handle_cmd_msg(struct hisi_qm *qm, u32 fun_num) * Get the msg from source by sending mailbox. Whether message is got * successfully, destination needs to ack source by clearing the interrupt. */ - ret = qm_get_mb_cmd(qm, &msg, fun_num); + ret = hisi_qm_mb_read(qm, &msg, QM_MB_CMD_DST, fun_num); qm_clear_cmd_interrupt(qm, BIT(fun_num)); if (ret) { dev_err(dev, "failed to get msg from source!\n"); diff --git a/include/linux/hisi_acc_qm.h b/include/linux/hisi_acc_qm.h index 39fbfb4be944..0f83c19a8f36 100644 --- a/include/linux/hisi_acc_qm.h +++ b/include/linux/hisi_acc_qm.h @@ -52,7 +52,6 @@ #define QM_MB_OP_SHIFT 14 #define QM_MB_CMD_DATA_ADDR_L 0x304 #define QM_MB_CMD_DATA_ADDR_H 0x308 -#define QM_MB_MAX_WAIT_CNT 6000 /* doorbell */ #define QM_DOORBELL_CMD_SQ 0
The malibox needs to be triggered by a 128bit atomic operation. The reason is that one QM hardware entity in one accelerator servers QM mailbox MMIO interfaces in related PF and VFs. A mutex cannot lock mailbox processes in different functions. When multiple functions access the mailbox simultaneously, if the generic IO interface readq/writeq is used to access the mailbox, the data read from mailbox or written to mailbox is unpredictable. Therefore, the generic IO interface is changed to a 128bit atomic operation. Signed-off-by: Weili Qian <qianweili@huawei.com> --- drivers/crypto/hisilicon/qm.c | 160 ++++++++++++++++++++++------------ include/linux/hisi_acc_qm.h | 1 - 2 files changed, 105 insertions(+), 56 deletions(-) mode change 100644 => 100755 drivers/crypto/hisilicon/qm.c