Message ID | 20180119145303.4097-1-linus.walleij@linaro.org |
---|---|
State | Superseded |
Headers | show |
Series | [v6] RFT: mmc: sdhci: Implement an SDHCI-specific bounce buffer | expand |
Good morning, I tested it a few moments ago and to make it short: it doesn't work. Here is the long story: sdhci: Secure Digital Host Controller Interface driver sdhci: Copyright(c) Pierre Ossman sdhci-pltfm: SDHCI platform and OF driver helper mmc0 bounce up to 128 segments into one, max segment size 65536 bytes mmc0: SDHCI controller on 53fb4000.esdhc [53fb4000.esdhc] using DMA sdhci-esdhc-imx 53fb8000.esdhc: Got CD GPIO sdhci-esdhc-imx 53fb8000.esdhc: Got WP GPIO mmc1 bounce up to 128 segments into one, max segment size 65536 bytes mmc1: SDHCI controller on 53fb8000.esdhc [53fb8000.esdhc] using DMA oprofile: no performance counters oprofile: using timer interrupt. mmc0: new high speed MMC card at address 0001 NET: Registered protocol family 10 mmcblk0: mmc0:0001 002G00 1.83 GiB mmcblk0boot0: mmc0:0001 002G00 partition 1 512 KiB Segment Routing with IPv6 sit: IPv6, IPv4 and MPLS over IPv4 tunneling driver mmcblk0boot1: mmc0:0001 002G00 partition 2 512 KiB mmcblk0rpmb: mmc0:0001 002G00 partition 3 128 KiB mmcblk0: p1 p2 p3 p4 < p5 p6 p7 p8 p9 p10 > NET: Registered protocol family 17 can: controller area network core (rev 20170425 abi 9) NET: Registered protocol family 29 can: raw protocol (rev 20170425) can: broadcast manager protocol (rev 20170425 t) can: netlink gateway (rev 20170425) max_hops=1 input: gpio-keys as /devices/platform/gpio-keys/input/input1 imxdi_rtc 53ffc000.dryice: setting system clock to 1970-01-12 21:01:47 UTC (1026107) EXT4-fs (mmcblk0p8): couldn't mount as ext3 due to feature incompatibilities EXT4-fs (mmcblk0p8): couldn't mount as ext2 due to feature incompatibilities EXT4-fs (mmcblk0p8): ext4_check_descriptors: Checksum for group 4 failed (25648!=0) EXT4-fs (mmcblk0p8): ext4_check_descriptors: Block bitmap for group 5 not in group (block 0)! EXT4-fs (mmcblk0p8): group descriptors corrupted! VFS: Cannot open root device "mmcblk0p8" or unknown-block(259,0): error -117 Please append a correct "root=" boot option; here are the available partitions: b300 1916928 mmcblk0 driver: mmcblk b301 767 mmcblk0p1 00000000-01 b302 128 mmcblk0p2 00000000-02 b303 128 mmcblk0p3 00000000-03 b304 1 mmcblk0p4 b305 16384 mmcblk0p5 00000000-05 b306 409599 mmcblk0p6 00000000-06 b307 16383 mmcblk0p7 00000000-07 103:00000 409599 mmcblk0p8 00000000-08 103:00001 102399 mmcblk0p9 00000000-09 103:00002 961535 mmcblk0p10 00000000-0a b318 128 mmcblk0rpmb (driver?) b310 512 mmcblk0boot1 (driver?) b308 512 mmcblk0boot0 (driver?) Kernel panic - not syncing: VFS: Unable to mount root fs on unknown-block(259,0) ---[ end Kernel panic - not syncing: VFS: Unable to mount root fs on unknown-block(259,0) Kind regards, Benjamin Beckmeyer
On Mon, Jan 22, 2018 at 7:18 AM, Benjamin Beckmeyer <beckmeyer.b@rittal.de> wrote: > I tested it a few moments ago and to make it short: it doesn't work. Annoying ... since the i.MX and x86 PCI version so clearly differs here I'm left to trial and error over the mailing list. > mmc1 bounce up to 128 segments into one, max segment size 65536 bytes > mmc1: SDHCI controller on 53fb8000.esdhc [53fb8000.esdhc] using DMA Could it be that this SDMA has this problem of not being able to send things aligned on even 64K pages? (Off-by-one error in the hardware.) Could you try this on top of the patch? diff --git a/drivers/mmc/host/sdhci.c b/drivers/mmc/host/sdhci.c index 25b5e598ee6d..6b436f0e0925 100644 --- a/drivers/mmc/host/sdhci.c +++ b/drivers/mmc/host/sdhci.c @@ -3323,7 +3323,7 @@ static int sdhci_allocate_bounce_buffer(struct sdhci_host *host) * has diminishing returns, this is probably because SD/MMC * cards are usually optimized to handle this size of requests. */ - bounce_size = SZ_64K; + bounce_size = SZ_64K-1; /* * Adjust downwards to maximum request size if this is less * than our segment size, else hammer down the maximum The code should report that it bounces 127 segments into 1 instead. Yours, Linus Walleij
Hey Linus, > Annoying ... since the i.MX and x86 PCI version so clearly differs > here I'm left to trial and error over the mailing list. I know, can I do something that it makes it easier for you? > Could you try this on top of the patch? Built in and here is the output. I think, it's what you want, but it doesn't work either. sdhci: Secure Digital Host Controller Interface driver sdhci: Copyright(c) Pierre Ossman sdhci-pltfm: SDHCI platform and OF driver helper mmc0 bounce up to 127 segments into one, max segment size 65535 bytes mmc0: SDHCI controller on 53fb4000.esdhc [53fb4000.esdhc] using DMA sdhci-esdhc-imx 53fb8000.esdhc: Got CD GPIO sdhci-esdhc-imx 53fb8000.esdhc: Got WP GPIO mmc1 bounce up to 127 segments into one, max segment size 65535 bytes mmc1: SDHCI controller on 53fb8000.esdhc [53fb8000.esdhc] using DMA oprofile: no performance counters oprofile: using timer interrupt. mmc0: new high speed MMC card at address 0001 NET: Registered protocol family 10 mmcblk0: mmc0:0001 002G00 1.83 GiB Segment Routing with IPv6 sit: IPv6, IPv4 and MPLS over IPv4 tunneling driver mmcblk0boot0: mmc0:0001 002G00 partition 1 512 KiB mmcblk0boot1: mmc0:0001 002G00 partition 2 512 KiB NET: Registered protocol family 17 can: controller area network core (rev 20170425 abi 9) NET: Registered protocol family 29 can: raw protocol (rev 20170425) can: broadcast manager protocol (rev 20170425 t) can: netlink gateway (rev 20170425) max_hops=1 input: gpio-keys as /devices/platform/gpio-keys/input/input1 mmcblk0rpmb: mmc0:0001 002G00 partition 3 128 KiB imxdi_rtc 53ffc000.dryice: setting system clock to 1970-01-12 23:15:39 UTC (1034139) mmcblk0: p1 p2 p3 p4 < p5 p6 p7 p8 p9 p10 > EXT4-fs (mmcblk0p8): couldn't mount as ext3 due to feature incompatibilities EXT4-fs (mmcblk0p8): couldn't mount as ext2 due to feature incompatibilities EXT4-fs (mmcblk0p8): ext4_check_descriptors: Checksum for group 3 failed (1874!=0) EXT4-fs (mmcblk0p8): ext4_check_descriptors: Block bitmap for group 4 not in group (block 0)! EXT4-fs (mmcblk0p8): group descriptors corrupted! VFS: Cannot open root device "mmcblk0p8" or unknown-block(259,0): error -117 Please append a correct "root=" boot option; here are the available partitions: b300 1916928 mmcblk0 driver: mmcblk b301 767 mmcblk0p1 00000000-01 b302 128 mmcblk0p2 00000000-02 b303 128 mmcblk0p3 00000000-03 b304 1 mmcblk0p4 b305 16384 mmcblk0p5 00000000-05 b306 409599 mmcblk0p6 00000000-06 b307 16383 mmcblk0p7 00000000-07 103:00000 409599 mmcblk0p8 00000000-08 103:00001 102399 mmcblk0p9 00000000-09 103:00002 961535 mmcblk0p10 00000000-0a b318 128 mmcblk0rpmb (driver?) b310 512 mmcblk0boot1 (driver?) b308 512 mmcblk0boot0 (driver?) Kernel panic - not syncing: VFS: Unable to mount root fs on unknown-block(259,0) ---[ end Kernel panic - not syncing: VFS: Unable to mount root fs on unknown-block(259,0) Kind regards, Benjamin Beckmeyer
On Mon, Jan 22, 2018 at 9:33 AM, Benjamin Beckmeyer <beckmeyer.b@rittal.de> wrote: > Hey Linus, > >> Annoying ... since the i.MX and x86 PCI version so clearly differs >> here I'm left to trial and error over the mailing list. > > I know, can I do something that it makes it easier for you? It's just how it is. Let's use mails and brains for debugging :D >> Could you try this on top of the patch? > > Built in and here is the output. I think, it's what you want, > but it doesn't work either. OK that doesn't work. Scratch that patch. What about this instead? diff --git a/drivers/mmc/host/sdhci.c b/drivers/mmc/host/sdhci.c index 25b5e598ee6d..416e94dead4f 100644 --- a/drivers/mmc/host/sdhci.c +++ b/drivers/mmc/host/sdhci.c @@ -523,7 +523,7 @@ static int sdhci_pre_dma_transfer(struct sdhci_host *host, dma_sync_single_for_device(host->mmc->parent, host->bounce_addr, host->bounce_buffer_size, - DMA_TO_DEVICE); + mmc_get_dma_dir(data)); /* Just a dummy value */ sg_count = 1; } else { @@ -2422,6 +2422,12 @@ static bool sdhci_request_done(struct sdhci_host *host) data->sg_len, host->bounce_buffer, length); + } else { + dma_sync_single_for_cpu( + host->mmc->parent, + host->bounce_addr, + host->bounce_buffer_size, + mmc_get_dma_dir(data)); } } else { /* Unmap the raw data */ Yours, Linus Walleij
> OK that doesn't work. Scratch that patch. What about this instead?
That is working.
I did some quick testing with time and dd again.
#time dd if=/dev/zero of=test bs=1M count=4
real 0m1.898s
user 0m0.002s
sys 0m0.392s
#time dd if=/dev/zero of=test bs=1M count=8
real 0m5.292s
user 0m0.004s
sys 0m0.772s
#time dd if=/dev/zero of=test bs=1M count=16
real 0m10.130s
user 0m0.009s
sys 0m1.479s
#time dd if=/dev/zero of=test bs=1M count=32
real 0m18.752s
user 0m0.001s
sys 0m2.934s
Looks good to me. So far.
Kind regards,
Benjamin Beckmeyer
On Mon, Jan 22, 2018 at 10:59 AM, Benjamin Beckmeyer <beckmeyer.b@rittal.de> wrote: >> OK that doesn't work. Scratch that patch. What about this instead? > > That is working. OK so it was just my random abuse of the DMA API. Sorry for my shitty code that of course works fine on the DMA-coherent x86... > I did some quick testing with time and dd again. OK :) I'll fold this patch into the original, tag with your Tested-by and send out as non-RFC. Yours, Linus Walleij
diff --git a/drivers/mmc/host/sdhci.c b/drivers/mmc/host/sdhci.c index e9290a3439d5..26b93d72f56d 100644 --- a/drivers/mmc/host/sdhci.c +++ b/drivers/mmc/host/sdhci.c @@ -21,6 +21,7 @@ #include <linux/dma-mapping.h> #include <linux/slab.h> #include <linux/scatterlist.h> +#include <linux/sizes.h> #include <linux/swiotlb.h> #include <linux/regulator/consumer.h> #include <linux/pm_runtime.h> @@ -502,8 +503,35 @@ static int sdhci_pre_dma_transfer(struct sdhci_host *host, if (data->host_cookie == COOKIE_PRE_MAPPED) return data->sg_count; - sg_count = dma_map_sg(mmc_dev(host->mmc), data->sg, data->sg_len, - mmc_get_dma_dir(data)); + /* Bounce write requests to the bounce buffer */ + if (host->bounce_buffer) { + unsigned int length = data->blksz * data->blocks; + + if (length > host->bounce_buffer_size) { + pr_err("%s: asked for transfer of %u bytes exceeds bounce buffer %u bytes\n", + mmc_hostname(host->mmc), length, + host->bounce_buffer_size); + return -EIO; + } + if (mmc_get_dma_dir(data) == DMA_TO_DEVICE) { + /* Copy the data to the bounce buffer */ + sg_copy_to_buffer(data->sg, data->sg_len, + host->bounce_buffer, + length); + } + /* Switch ownership to the DMA */ + dma_sync_single_for_device(host->mmc->parent, + host->bounce_addr, + host->bounce_buffer_size, + DMA_TO_DEVICE); + /* Just a dummy value */ + sg_count = 1; + } else { + /* Just access the data directly from memory */ + sg_count = dma_map_sg(mmc_dev(host->mmc), + data->sg, data->sg_len, + mmc_get_dma_dir(data)); + } if (sg_count == 0) return -ENOSPC; @@ -858,8 +886,13 @@ static void sdhci_prepare_data(struct sdhci_host *host, struct mmc_command *cmd) SDHCI_ADMA_ADDRESS_HI); } else { WARN_ON(sg_cnt != 1); - sdhci_writel(host, sg_dma_address(data->sg), - SDHCI_DMA_ADDRESS); + /* Bounce buffer goes to work */ + if (host->bounce_buffer) + sdhci_writel(host, host->bounce_addr, + SDHCI_DMA_ADDRESS); + else + sdhci_writel(host, sg_dma_address(data->sg), + SDHCI_DMA_ADDRESS); } } @@ -2248,7 +2281,12 @@ static void sdhci_pre_req(struct mmc_host *mmc, struct mmc_request *mrq) mrq->data->host_cookie = COOKIE_UNMAPPED; - if (host->flags & SDHCI_REQ_USE_DMA) + /* + * No pre-mapping in the pre hook if we're using the bounce buffer, + * for that we would need two bounce buffers since one buffer is + * in flight when this is getting called. + */ + if (host->flags & SDHCI_REQ_USE_DMA && !host->bounce_buffer) sdhci_pre_dma_transfer(host, mrq->data, COOKIE_PRE_MAPPED); } @@ -2352,8 +2390,38 @@ static bool sdhci_request_done(struct sdhci_host *host) struct mmc_data *data = mrq->data; if (data && data->host_cookie == COOKIE_MAPPED) { - dma_unmap_sg(mmc_dev(host->mmc), data->sg, data->sg_len, - mmc_get_dma_dir(data)); + if (host->bounce_buffer) { + /* + * On reads, copy the bounced data into the + * sglist + */ + if (mmc_get_dma_dir(data) == DMA_FROM_DEVICE) { + unsigned int length = data->bytes_xfered; + + if (length > host->bounce_buffer_size) { + pr_err("%s: bounce buffer is %u bytes but DMA claims to have transferred %u bytes\n", + mmc_hostname(host->mmc), + host->bounce_buffer_size, + data->bytes_xfered); + /* Cap it down and continue */ + length = host->bounce_buffer_size; + } + dma_sync_single_for_cpu( + host->mmc->parent, + host->bounce_addr, + host->bounce_buffer_size, + DMA_FROM_DEVICE); + sg_copy_from_buffer(data->sg, + data->sg_len, + host->bounce_buffer, + length); + } + } else { + /* Unmap the raw data */ + dma_unmap_sg(mmc_dev(host->mmc), data->sg, + data->sg_len, + mmc_get_dma_dir(data)); + } data->host_cookie = COOKIE_UNMAPPED; } } @@ -2543,6 +2611,14 @@ static void sdhci_adma_show_error(struct sdhci_host *host) } } +static u32 sdhci_sdma_address(struct sdhci_host *host) +{ + if (host->bounce_buffer) + return host->bounce_addr; + else + return sg_dma_address(host->data->sg); +} + static void sdhci_data_irq(struct sdhci_host *host, u32 intmask) { u32 command; @@ -2636,7 +2712,8 @@ static void sdhci_data_irq(struct sdhci_host *host, u32 intmask) */ if (intmask & SDHCI_INT_DMA_END) { u32 dmastart, dmanow; - dmastart = sg_dma_address(host->data->sg); + + dmastart = sdhci_sdma_address(host); dmanow = dmastart + host->data->bytes_xfered; /* * Force update to the next DMA block boundary. @@ -3217,6 +3294,68 @@ void __sdhci_read_caps(struct sdhci_host *host, u16 *ver, u32 *caps, u32 *caps1) } EXPORT_SYMBOL_GPL(__sdhci_read_caps); +static int sdhci_allocate_bounce_buffer(struct sdhci_host *host) +{ + struct mmc_host *mmc = host->mmc; + unsigned int max_blocks; + unsigned int bounce_size; + int ret; + + /* + * Cap the bounce buffer at 64KB. Using a bigger bounce buffer + * has diminishing returns, this is probably because SD/MMC + * cards are usually optimized to handle this size of requests. + */ + bounce_size = SZ_64K; + /* + * Adjust downwards to maximum request size if this is less + * than our segment size, else hammer down the maximum + * request size to the maximum buffer size. + */ + if (mmc->max_req_size < bounce_size) + bounce_size = mmc->max_req_size; + max_blocks = bounce_size / 512; + + /* + * When we just support one segment, we can get significant + * speedups by the help of a bounce buffer to group scattered + * reads/writes together. + */ + host->bounce_buffer = devm_kmalloc(mmc->parent, + bounce_size, + GFP_KERNEL); + if (!host->bounce_buffer) { + pr_err("%s: failed to allocate %u bytes for bounce buffer, falling back to single segments\n", + mmc_hostname(mmc), + bounce_size); + /* + * Exiting with zero here makes sure we proceed with + * mmc->max_segs == 1. + */ + return 0; + } + + host->bounce_addr = dma_map_single(mmc->parent, + host->bounce_buffer, + bounce_size, + DMA_BIDIRECTIONAL); + ret = dma_mapping_error(mmc->parent, host->bounce_addr); + if (ret) + /* Again fall back to max_segs == 1 */ + return 0; + host->bounce_buffer_size = bounce_size; + + /* Lie about this since we're bouncing */ + mmc->max_segs = max_blocks; + mmc->max_seg_size = bounce_size; + mmc->max_req_size = bounce_size; + + pr_info("%s bounce up to %u segments into one, max segment size %u bytes\n", + mmc_hostname(mmc), max_blocks, bounce_size); + + return 0; +} + int sdhci_setup_host(struct sdhci_host *host) { struct mmc_host *mmc; @@ -3713,6 +3852,13 @@ int sdhci_setup_host(struct sdhci_host *host) */ mmc->max_blk_count = (host->quirks & SDHCI_QUIRK_NO_MULTIBLOCK) ? 1 : 65535; + if (mmc->max_segs == 1) { + /* This may alter mmc->*_blk_* parameters */ + ret = sdhci_allocate_bounce_buffer(host); + if (ret) + return ret; + } + return 0; unreg: diff --git a/drivers/mmc/host/sdhci.h b/drivers/mmc/host/sdhci.h index 54bc444c317f..1d7d61e25dbf 100644 --- a/drivers/mmc/host/sdhci.h +++ b/drivers/mmc/host/sdhci.h @@ -440,6 +440,9 @@ struct sdhci_host { int irq; /* Device IRQ */ void __iomem *ioaddr; /* Mapped address */ + char *bounce_buffer; /* For packing SDMA reads/writes */ + dma_addr_t bounce_addr; + unsigned int bounce_buffer_size; const struct sdhci_ops *ops; /* Low level hw interface */
The bounce buffer is gone from the MMC core, and now we found out that there are some (crippled) i.MX boards out there that have broken ADMA (cannot do scatter-gather), and also broken PIO so they must use SDMA. Closer examination shows a less significant slowdown also on SDMA-only capable Laptop hosts. SDMA sets down the number of segments to one, so that each segment gets turned into a singular request that ping-pongs to the block layer before the next request/segment is issued. Apparently it happens a lot that the block layer send requests that include a lot of physically discontigous segments. My guess is that this phenomenon is coming from the file system. These devices that cannot handle scatterlists in hardware can see major benefits from a DMA-contigous bounce buffer. This patch accumulates those fragmented scatterlists in a physically contigous bounce buffer so that we can issue bigger DMA data chunks to/from the card. When tested with thise PCI-integrated host (1217:8221) that only supports SDMA: 0b:00.0 SD Host controller: O2 Micro, Inc. OZ600FJ0/OZ900FJ0/OZ600FJS SD/MMC Card Reader Controller (rev 05) This patch gave ~1Mbyte/s improved throughput on large reads and writes when testing using iozone than without the patch. dmesg: sdhci-pci 0000:0b:00.0: SDHCI controller found [1217:8221] (rev 5) mmc0 bounce up to 128 segments into one, max segment size 65536 bytes mmc0: SDHCI controller on PCI [0000:0b:00.0] using DMA On the i.MX SDHCI controllers on the crippled i.MX 25 and i.MX 35 the patch restores the performance to what it was before we removed the bounce buffers, and then some: performance is better than ever because we now allocate a bounce buffer the size of the maximum single request the SDMA engine can handle. On the PCI laptop this is 256K, whereas with the old bounce buffer code it was 64K max. Cc: Benjamin Beckmeyer <beckmeyer.b@rittal.de> Cc: Pierre Ossman <pierre@ossman.eu> Cc: Benoît Thébaudeau <benoit@wsystem.com> Cc: Fabio Estevam <fabio.estevam@nxp.com> Cc: stable@vger.kernel.org Fixes: de3ee99b097d ("mmc: Delete bounce buffer handling") Signed-off-by: Linus Walleij <linus.walleij@linaro.org> --- ChangeLog v5->v6: - Again switch back to explicit sync of buffers. I want to get this solution to work because it gives more control and it's more elegant. - Update host->max_req_size as noted by Adrian, hopefully this fixes the i.MX. I was just lucky on my Intel laptop I guess: the block stack never requested anything bigger than 64KB and that was why it worked even if max_req_size was bigger than what would fit in the bounce buffer. - Copy the number of bytes in the mmc_data instead of the number of bytes in the bounce buffer. For RX this is blksize * blocks and for TX this is bytes_xfered. - Break out a sdhci_sdma_address() for getting the DMA address for either the raw sglist or the bounce buffer depending on configuration. - Add some explicit bounds check for the data so that we do not attempt to copy more than the bounce buffer size even if the block layer is erroneously configured. - Move allocation of bounce buffer out to its own function. - Use pr_[info|err] throughout so all debug prints from the driver come out in the same manner and style. - Use unsigned int for the bounce buffer size. - Re-tested with iozone: we still get the same nice performance improvements. - Request a text on i.MX (hi Benjamin) ChangeLog v4->v5: - Go back to dma_alloc_coherent() as this apparently works better. - Keep the other changes, cap for 64KB, fall back to single segments. - Requesting a test of this on i.MX. (Sorry Benjamin.) ChangeLog v3->v4: - Cap the bounce buffer to 64KB instead of the biggest segment as we experience diminishing returns with buffers > 64KB. - Instead of using dma_alloc_coherent(), use good old devm_kmalloc() and issue dma_sync_single_for*() to explicitly switch ownership between CPU and the device. This way we exercise the cache better and may consume less CPU. - Bail out with single segments if we cannot allocate a bounce buffer. - Tested on the PCI SDHCI on my laptop: requesting a new test on i.MX from Benjamin. (Please!) ChangeLog v2->v3: - Rewrite the commit message a bit - Add Benjamin's Tested-by - Add Fixes and stable tags ChangeLog v1->v2: - Skip the remapping and fiddling with the buffer, instead use dma_alloc_coherent() and use a simple, coherent bounce buffer. - Couple kernel messages to ->parent of the mmc_host as it relates to the hardware characteristics. --- drivers/mmc/host/sdhci.c | 162 ++++++++++++++++++++++++++++++++++++++++++++--- drivers/mmc/host/sdhci.h | 3 + 2 files changed, 157 insertions(+), 8 deletions(-) -- 2.14.3