diff mbox series

[RFC,01/16] Documentation: driver: add SoundWire BRA description

Message ID 20231207222944.663893-2-pierre-louis.bossart@linux.intel.com
State New
Headers show
Series soundwire/ASoC: speed-up downloads with BTP/BRA protocol | expand

Commit Message

Pierre-Louis Bossart Dec. 7, 2023, 10:29 p.m. UTC
The Bulk Register Access protocol was left as a TODO topic since
2018. It's time to document this protocol and the design of its Linux
support.

Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>
---
 Documentation/driver-api/soundwire/bra.rst    | 478 ++++++++++++++++++
 Documentation/driver-api/soundwire/index.rst  |   1 +
 .../driver-api/soundwire/summary.rst          |   5 +-
 3 files changed, 480 insertions(+), 4 deletions(-)
 create mode 100644 Documentation/driver-api/soundwire/bra.rst

Comments

Mark Brown Dec. 7, 2023, 11:29 p.m. UTC | #1
On Thu, Dec 07, 2023 at 04:29:29PM -0600, Pierre-Louis Bossart wrote:

> +Addressing restrictions
> +-----------------------
> +
> +The Device Number specified in the Header follows the SoundWire
> +definitions, and broadcast and group addressing are
> +permitted. However, in reality it is very unlikely that the exact same
> +binary data needs to be provided to the two different Peripheral
> +devices. The Linux implementation only allows for transfers to a
> +single device at a time.

For the firmware download case it seems likely that this won't always be
the case, but it's definitely a thing that could be done incrementally.

> +Regmap use
> +~~~~~~~~~~

> +Existing codec drivers rely on regmap to download firmware to
> +Peripherals, so at a high-level it would seem natural to combine BRA
> +and regmap. The regmap layer could check if BRA is available or not,
> +and use a regular read-write command channel in the latter case.

> +However, the regmap layer does not have information on latency or how
> +critical a BRA transfer is. It would make more sense to let the codec
> +driver make decisions (wait, timeout, fallback to regular
> +read/writes).

These don't seem like insurmountable obstacles - they feel more like a
copy break kind of situation where we can infer things from the pattern
of transactions, and there's always the possibility of adding some calls
to give regmap more high level information about the overall state of
the device.  One of the expected usage patterns with cache only mode is
to build up a final register state then let the cache code work out the
optimal pattern to actually write that out.

> +In addition, the hardware may lose context and ask for the firmware to
> +be downloaded again. The firmware is not however a fixed definition,
> +the SDCA definition allows the hardware to request an updated firmware
> +or a different coefficient table to deal with specific environment
> +conditions. In other words, traditional regmap cache management is not
> +good enough, the Peripheral driver is required track hardware
> +notifications and react accordingly.

I might be missing something but those requests for redownload sound
like things that would occur regardless of the mechanism used to perform
the I/O?

> +Concurrency between BRA and regular read/write
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

> +The existing 'nread/nwrite' API already relies on a notion of start
> +address and number of bytes, so it would be possible to extend this
> +API with a 'hint' requesting BPT/BRA be used.

> +However BRA transfers could be quite long, and the use of a single
> +mutex for regular read/write and BRA is a show-stopper. Independent
> +operation of the control/command and BRA transfers is a fundamental
> +requirement, e.g. to change the volume level with the existing regmap
> +interface while downloading firmware.

> +This places the burden on the codec driver to verify that there is no
> +concurrent access to the same address with the command/control
> +protocol and the BRA protocol.

This makes me feel very nervous about robustness.  I do note that regmap
has an async interface which is currently only used for SPI and really
only by the Cirrus DSPs (plus opportunisticly by rbtree cache sync), the
original usage was to fill the pipleine of SPI controllers so we can
ideally push data entirely from interrupt.  TBH that was done before SMP
became standard and it's a lot less clear in this day and age that this
is actually helpful, the overhead of cross core synchronisation likely
obliterates any benefit.  There's sufficently few users that we could
make API changes readily to fit better.

> +One possible strategy to speed-up all initialization tasks would be to
> +start a BRA transfer for firmware download, then deal with all the
> +"regular" read/writes in parallel with the command channel, and last
> +to wait for the BRA transfers to complete. This would allow for a
> +degree of overlap instead of a purely sequential solution. As a
> +results, the BRA API must support async transfers and expose a
> +separate wait function.

This feels a lot like it could map onto the regmap async interface, it
would need a bit of a rework (mainly that currently they provide
ordering guarantees with respect to both each other and sync I/O) but
those could be removed with some care) but it's got the "here's a list
of I/O, here's another call to ensure it's all actually happened" thing.
It sounds very much like how I was thinking of the async API when I was
writing it and the initial users.

It's this bit that really got me thinking it could fit well into regmap.

> +Error handling
> +~~~~~~~~~~~~~~
> +
> +The expected response to a 'bra_message' and follow-up behavior may
> +vary:
> +
> +   (1) A Peripheral driver may want to receive an immediate -EBUSY
> +       response if the BRA protocol is not available at a given time.
> +
> +   (2) A Peripheral driver may want to wait until a timeout for the
> +       on-going transfer to be handled
> +
> +   (3) A Peripheral driver may want to wait until existing BRA
> +       transfers complete or deal with BRA as a background task when
> +       audio transfers stop. In this case, there would be no timeout,
> +       and the operation may not happen if the platform is suspended.

Option 3 would be a jump for regmap.
Pierre-Louis Bossart Dec. 8, 2023, 12:56 a.m. UTC | #2
Thanks for the comments Mark, much appreciated.

>> +Addressing restrictions
>> +-----------------------
>> +
>> +The Device Number specified in the Header follows the SoundWire
>> +definitions, and broadcast and group addressing are
>> +permitted. However, in reality it is very unlikely that the exact same
>> +binary data needs to be provided to the two different Peripheral
>> +devices. The Linux implementation only allows for transfers to a
>> +single device at a time.
> 
> For the firmware download case it seems likely that this won't always be
> the case, but it's definitely a thing that could be done incrementally.

One example discussed this week on the mailing list is the Cirrus Logic
case, where the firmware contains information on which speakers a given
amplifier should be doing, and each firmware is named as AMP-n.

I have really not found any practical case where the same data needs to
be sent to N devices, and I don't have a burning desire to tie the codec
initialization together with all the asynchronous nature of
enumeration/probe.

>> +Regmap use
>> +~~~~~~~~~~
> 
>> +Existing codec drivers rely on regmap to download firmware to
>> +Peripherals, so at a high-level it would seem natural to combine BRA
>> +and regmap. The regmap layer could check if BRA is available or not,
>> +and use a regular read-write command channel in the latter case.
> 
>> +However, the regmap layer does not have information on latency or how
>> +critical a BRA transfer is. It would make more sense to let the codec
>> +driver make decisions (wait, timeout, fallback to regular
>> +read/writes).
> 
> These don't seem like insurmountable obstacles - they feel more like a
> copy break kind of situation where we can infer things from the pattern
> of transactions, and there's always the possibility of adding some calls
> to give regmap more high level information about the overall state of
> the device.  One of the expected usage patterns with cache only mode is
> to build up a final register state then let the cache code work out the
> optimal pattern to actually write that out.

I did expect some pushback on regmap :-)

The point is really that the main use for this download is a set of
write-once memory areas which happen to be mapped as registers. There is
no real need to declare or access each memory address individually.

In addition in terms of error handling, the expectation is that the set
of writes are handled in a pass/fail manner. There's no real way to know
which of the individual writes failed.

>> +In addition, the hardware may lose context and ask for the firmware to
>> +be downloaded again. The firmware is not however a fixed definition,
>> +the SDCA definition allows the hardware to request an updated firmware
>> +or a different coefficient table to deal with specific environment
>> +conditions. In other words, traditional regmap cache management is not
>> +good enough, the Peripheral driver is required track hardware
>> +notifications and react accordingly.
> 
> I might be missing something but those requests for redownload sound
> like things that would occur regardless of the mechanism used to perform
> the I/O?

What I meant is that the codec driver would e.g. need to fetch a
different firmware table and download it. There's no real need to
maintain a cache on the host side since the entire table will be downloaded.

>> +Concurrency between BRA and regular read/write
>> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> 
>> +The existing 'nread/nwrite' API already relies on a notion of start
>> +address and number of bytes, so it would be possible to extend this
>> +API with a 'hint' requesting BPT/BRA be used.
> 
>> +However BRA transfers could be quite long, and the use of a single
>> +mutex for regular read/write and BRA is a show-stopper. Independent
>> +operation of the control/command and BRA transfers is a fundamental
>> +requirement, e.g. to change the volume level with the existing regmap
>> +interface while downloading firmware.
> 
>> +This places the burden on the codec driver to verify that there is no
>> +concurrent access to the same address with the command/control
>> +protocol and the BRA protocol.
> 
> This makes me feel very nervous about robustness.  I do note that regmap
> has an async interface which is currently only used for SPI and really
> only by the Cirrus DSPs (plus opportunisticly by rbtree cache sync), the
> original usage was to fill the pipleine of SPI controllers so we can
> ideally push data entirely from interrupt.  TBH that was done before SMP
> became standard and it's a lot less clear in this day and age that this
> is actually helpful, the overhead of cross core synchronisation likely
> obliterates any benefit.  There's sufficently few users that we could
> make API changes readily to fit better.

I am not that nervous, it's a known hardware issue and the software
drivers SHALL make sure that the two methods of accessing a register are
not used concurrently. We could add some sort of mutex on specific
memory areas.

>> +One possible strategy to speed-up all initialization tasks would be to
>> +start a BRA transfer for firmware download, then deal with all the
>> +"regular" read/writes in parallel with the command channel, and last
>> +to wait for the BRA transfers to complete. This would allow for a
>> +degree of overlap instead of a purely sequential solution. As a
>> +results, the BRA API must support async transfers and expose a
>> +separate wait function.
> 
> This feels a lot like it could map onto the regmap async interface, it
> would need a bit of a rework (mainly that currently they provide
> ordering guarantees with respect to both each other and sync I/O) but
> those could be removed with some care) but it's got the "here's a list
> of I/O, here's another call to ensure it's all actually happened" thing.
> It sounds very much like how I was thinking of the async API when I was
> writing it and the initial users.
> 
> It's this bit that really got me thinking it could fit well into regmap.

I really don't know anything about this async interface, if you have
pointers on existing examples I am all ears....I am not aware of any
Intel platform or codec used on an Intel platform making use of this API.

At any rate the low-level behavior is to
a) allocate and configure all the SoundWire resources
b) allocate and configure all the DMA resources
c) trigger DMA and enable SoundWire transfers
d) wait for the DMA to complete

The codec API can be modified as needed, as long as there are provisions
for these 4 steps.

>> +Error handling
>> +~~~~~~~~~~~~~~
>> +
>> +The expected response to a 'bra_message' and follow-up behavior may
>> +vary:
>> +
>> +   (1) A Peripheral driver may want to receive an immediate -EBUSY
>> +       response if the BRA protocol is not available at a given time.
>> +
>> +   (2) A Peripheral driver may want to wait until a timeout for the
>> +       on-going transfer to be handled
>> +
>> +   (3) A Peripheral driver may want to wait until existing BRA
>> +       transfers complete or deal with BRA as a background task when
>> +       audio transfers stop. In this case, there would be no timeout,
>> +       and the operation may not happen if the platform is suspended.
> 
> Option 3 would be a jump for regmap.

Sorry, I don't get what a 'jump' means in this context.
Charles Keepax Dec. 8, 2023, 4:27 p.m. UTC | #3
On Thu, Dec 07, 2023 at 04:29:29PM -0600, Pierre-Louis Bossart wrote:
> The Bulk Register Access protocol was left as a TODO topic since
> 2018. It's time to document this protocol and the design of its Linux
> support.
> 
> Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>
> ---
> +Concurrency between BRA and regular read/write
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +The existing 'nread/nwrite' API already relies on a notion of start
> +address and number of bytes, so it would be possible to extend this
> +API with a 'hint' requesting BPT/BRA be used.
> +
> +However BRA transfers could be quite long, and the use of a single
> +mutex for regular read/write and BRA is a show-stopper. Independent
> +operation of the control/command and BRA transfers is a fundamental
> +requirement, e.g. to change the volume level with the existing regmap
> +interface while downloading firmware.

Is this definitely a show stopper? Not saying that it wouldn't be
desirable to do both from a speed perspective, but current
systems that download firmware (I2C/SPI) typically will block the
bus for some amount of time. There are also some desirable properties
to a single lock such as not needing to worry about accessing the
same register in the bulk transfer and a normal command transfer.

> +Audio DMA support
> +-----------------
> +
> +Some DMAs, such as HDaudio, require an audio format field to be
> +set. This format is in turn used to define acceptable bursts. BPT/BRA
> +support is not fully compatible with these definitions in that the
> +format may vary between read and write commands.
> +
> +In addition, on Intel HDaudio Intel platforms the DMAs need to be
> +programmed with a PCM format matching the bandwidth of the BPT/BRA
> +transfer. The format is based on 48kHz 32-bit samples, and the number
> +of channels varies to adjust the bandwidth. The notion of channel is
> +completely notional since the data is not typical audio
> +PCM. Programming channels helps reserve enough bandwidth and adjust
> +FIFO sizes to avoid xruns. Note that the quality of service comes as a
> +cost. Since all channels need to be present as a sample block, data
> +sizes not aligned to 128-bytes are not supported.

Apologies but could you elaborate a litte on this? I am not sure
I follow the reasoning, how does the 48k 32bit DMA implementation
result in 128-byte limitation? I would have thought 1 channel would
be 4-bytes and you are varying the channels so I would have expected
4-byte aligned maybe 8-byte if the DMA expects stereo pairs.

And what exactly do we mean by aligned, are we saying the length
all transfers needs to be a multiple of 128-bytes?

I think we might have some annoying restrictions on the block
size on our hardware as well I will go dig into that and report
back.

Thanks,
Charles
Mark Brown Dec. 8, 2023, 9:49 p.m. UTC | #4
On Thu, Dec 07, 2023 at 06:56:55PM -0600, Pierre-Louis Bossart wrote:

> >> +The Device Number specified in the Header follows the SoundWire
> >> +definitions, and broadcast and group addressing are
> >> +permitted. However, in reality it is very unlikely that the exact same
> >> +binary data needs to be provided to the two different Peripheral
> >> +devices. The Linux implementation only allows for transfers to a
> >> +single device at a time.

> > For the firmware download case it seems likely that this won't always be
> > the case, but it's definitely a thing that could be done incrementally.

> One example discussed this week on the mailing list is the Cirrus Logic
> case, where the firmware contains information on which speakers a given
> amplifier should be doing, and each firmware is named as AMP-n.

I can imagine a vendor structuring things so they've got separate
firmware and coefficent/configuration images, the firmware images could
be shared.

> I have really not found any practical case where the same data needs to
> be sent to N devices, and I don't have a burning desire to tie the codec
> initialization together with all the asynchronous nature of
> enumeration/probe.

Like I say it does seem like something that could be done incrementally.

> > These don't seem like insurmountable obstacles - they feel more like a
> > copy break kind of situation where we can infer things from the pattern
> > of transactions, and there's always the possibility of adding some calls
> > to give regmap more high level information about the overall state of
> > the device.  One of the expected usage patterns with cache only mode is
> > to build up a final register state then let the cache code work out the
> > optimal pattern to actually write that out.

> I did expect some pushback on regmap :-)

> The point is really that the main use for this download is a set of
> write-once memory areas which happen to be mapped as registers. There is
> no real need to declare or access each memory address individually.

Yeah, normally it's just write only stuff but I've seen things like
controls being added in the DSP memory before - the 

> In addition in terms of error handling, the expectation is that the set
> of writes are handled in a pass/fail manner. There's no real way to know
> which of the individual writes failed.

That's the case for any block writes.

> > I might be missing something but those requests for redownload sound
> > like things that would occur regardless of the mechanism used to perform
> > the I/O?

> What I meant is that the codec driver would e.g. need to fetch a
> different firmware table and download it. There's no real need to
> maintain a cache on the host side since the entire table will be downloaded.

I mean, if that's the usage pattern surely things would just be marked
volatile anyway?  A cache is entirely optional.

> > This feels a lot like it could map onto the regmap async interface, it
> > would need a bit of a rework (mainly that currently they provide
> > ordering guarantees with respect to both each other and sync I/O) but
> > those could be removed with some care) but it's got the "here's a list
> > of I/O, here's another call to ensure it's all actually happened" thing.
> > It sounds very much like how I was thinking of the async API when I was
> > writing it and the initial users.

> > It's this bit that really got me thinking it could fit well into regmap.

> I really don't know anything about this async interface, if you have
> pointers on existing examples I am all ears....I am not aware of any
> Intel platform or codec used on an Intel platform making use of this API.

grep for regmap_.*async - cs_dsp.c is the upstream example in a driver,
or there's the rbtree cache sync code which uses a back door to go into
an async mode.  Basically just variants of all the normal regmap I/O
calls with a _complete() call you can use to wait for everything to
happen.  The implementation is a bit heavyweight since it was written to
work with fairly slow buses.

> At any rate the low-level behavior is to
> a) allocate and configure all the SoundWire resources
> b) allocate and configure all the DMA resources
> c) trigger DMA and enable SoundWire transfers
> d) wait for the DMA to complete

> The codec API can be modified as needed, as long as there are provisions
> for these 4 steps.

That's not quite how the current API works, but it feels like it's close
enough to the intent and there's literally one user of the actual API.

> >> +   (3) A Peripheral driver may want to wait until existing BRA
> >> +       transfers complete or deal with BRA as a background task when
> >> +       audio transfers stop. In this case, there would be no timeout,
> >> +       and the operation may not happen if the platform is suspended.

> > Option 3 would be a jump for regmap.

> Sorry, I don't get what a 'jump' means in this context.

Big change.
Vinod Koul Dec. 18, 2023, 11:40 a.m. UTC | #5
Hi Pierre,

On 07-12-23, 16:29, Pierre-Louis Bossart wrote:
> The Bulk Register Access protocol was left as a TODO topic since
> 2018. It's time to document this protocol and the design of its Linux
> support.

Thanks for letting me know about this thread. At least now with B4 we
can grab threads we are not copied.

> 
> Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>
> ---
>  Documentation/driver-api/soundwire/bra.rst    | 478 ++++++++++++++++++

Can we split the cadence parts of this to bra-cadence.rst that way this
file documents the core parts only

> +Concurrency between BRA and regular read/write
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +The existing 'nread/nwrite' API already relies on a notion of start
> +address and number of bytes, so it would be possible to extend this
> +API with a 'hint' requesting BPT/BRA be used.
> +
> +However BRA transfers could be quite long, and the use of a single
> +mutex for regular read/write and BRA is a show-stopper. Independent
> +operation of the control/command and BRA transfers is a fundamental
> +requirement, e.g. to change the volume level with the existing regmap
> +interface while downloading firmware.
> +
> +This places the burden on the codec driver to verify that there is no
> +concurrent access to the same address with the command/control
> +protocol and the BRA protocol.
> +
> +In addition, the 'sdw_msg' structure hard-codes support for 16-bit
> +addresses and paging registers which are irrelevant for BPT/BRA
> +support based on native 32-bit addresses. A separate API with
> +'sdw_bpt_msg' makes more sense.
> +
> +One possible strategy to speed-up all initialization tasks would be to
> +start a BRA transfer for firmware download, then deal with all the
> +"regular" read/writes in parallel with the command channel, and last
> +to wait for the BRA transfers to complete. This would allow for a
> +degree of overlap instead of a purely sequential solution. As a
> +results, the BRA API must support async transfers and expose a
> +separate wait function.

That sounds logical to me

> +
> +Error handling
> +~~~~~~~~~~~~~~
> +
> +The expected response to a 'bra_message' and follow-up behavior may
> +vary:
> +
> +   (1) A Peripheral driver may want to receive an immediate -EBUSY
> +       response if the BRA protocol is not available at a given time.
> +
> +   (2) A Peripheral driver may want to wait until a timeout for the
> +       on-going transfer to be handled
> +
> +   (3) A Peripheral driver may want to wait until existing BRA
> +       transfers complete or deal with BRA as a background task when
> +       audio transfers stop. In this case, there would be no timeout,
> +       and the operation may not happen if the platform is suspended.

Is this runtime suspend or S3/S4 case?

> +BTP/BRA API available to peripheral drivers
> +-------------------------------------------
> +
> +ASoC Peripheral drivers may use
> +
> +   - sdw_bpt_stream_open(mode)
> +
> +      This function verifies that the BPT protocol with the
> +      'mode'. For now only BRA is accepted as a mode. This function
> +      allocates a work buffer internally. This buffer is not exposed
> +      to the caller.
> +
> +     errors:
> +         -ENODEV: BPT/BRA is not supported by the Manager.
> +
> +         -EBUSY: another agent is already using the audio payload for
> +          audio transfers. There is no way to predict when the audio
> +          streams might stop, this will require the Peripheral driver
> +          to fall back to the regular (slow) command channel.
> +
> +         -EAGAIN: another agent is already transferring data using the
> +          BPT/BRA protocol. Since the transfers will typically last
> +          10s or 100s of ms, the Peripheral driver may wait and retry
> +          later.
> +
> +    - sdw_bpt_message_send_async(bpt_message)

why not have a single API that does both? First check if it is supported
and then allocate buffers and do the transfer.. What are the advantages
of using this two step process
Pierre-Louis Bossart Dec. 18, 2023, 12:58 p.m. UTC | #6
>>  Documentation/driver-api/soundwire/bra.rst    | 478 ++++++++++++++++++
> 
> Can we split the cadence parts of this to bra-cadence.rst that way this
> file documents the core parts only

Yes, we can split the Cadence parts out.


>> +Error handling
>> +~~~~~~~~~~~~~~
>> +
>> +The expected response to a 'bra_message' and follow-up behavior may
>> +vary:
>> +
>> +   (1) A Peripheral driver may want to receive an immediate -EBUSY
>> +       response if the BRA protocol is not available at a given time.
>> +
>> +   (2) A Peripheral driver may want to wait until a timeout for the
>> +       on-going transfer to be handled
>> +
>> +   (3) A Peripheral driver may want to wait until existing BRA
>> +       transfers complete or deal with BRA as a background task when
>> +       audio transfers stop. In this case, there would be no timeout,
>> +       and the operation may not happen if the platform is suspended.
> 
> Is this runtime suspend or S3/S4 case?

System suspend (which can also mean S0i1).

I don't think we can have a case where a peripheral driver waits on
something without having done a pm_runtime_get_sync() to prevent
runtime_pm suspend.

> 
>> +BTP/BRA API available to peripheral drivers
>> +-------------------------------------------
>> +
>> +ASoC Peripheral drivers may use
>> +
>> +   - sdw_bpt_stream_open(mode)
>> +
>> +      This function verifies that the BPT protocol with the
>> +      'mode'. For now only BRA is accepted as a mode. This function
>> +      allocates a work buffer internally. This buffer is not exposed
>> +      to the caller.
>> +
>> +     errors:
>> +         -ENODEV: BPT/BRA is not supported by the Manager.
>> +
>> +         -EBUSY: another agent is already using the audio payload for
>> +          audio transfers. There is no way to predict when the audio
>> +          streams might stop, this will require the Peripheral driver
>> +          to fall back to the regular (slow) command channel.
>> +
>> +         -EAGAIN: another agent is already transferring data using the
>> +          BPT/BRA protocol. Since the transfers will typically last
>> +          10s or 100s of ms, the Peripheral driver may wait and retry
>> +          later.
>> +
>> +    - sdw_bpt_message_send_async(bpt_message)
> 
> why not have a single API that does both? First check if it is supported
> and then allocate buffers and do the transfer.. What are the advantages
> of using this two step process

Symmetry is the only thing that comes to my mind. Open - close and send
- wait are natural matches, aren't they?

We do need a wait(), so bundling open() and send() would be odd.

But you have a point that the open() is not generic in that it also
prepares the DMA buffers for transmission. Maybe it's more natural to
follow the traditional open(), hw_params(), hw_free, close() from ALSA.
Charles Keepax Dec. 18, 2023, 2:29 p.m. UTC | #7
On Mon, Dec 18, 2023 at 01:58:47PM +0100, Pierre-Louis Bossart wrote:
> > why not have a single API that does both? First check if it is supported
> > and then allocate buffers and do the transfer.. What are the advantages
> > of using this two step process
> 
> Symmetry is the only thing that comes to my mind. Open - close and send
> - wait are natural matches, aren't they?
> 
> We do need a wait(), so bundling open() and send() would be odd.
> 

I agree send->wait->close would be odd, But you just bundle close
into wait. So the API becomes just send->wait, which seems pretty
logical.

> But you have a point that the open() is not generic in that it also
> prepares the DMA buffers for transmission. Maybe it's more natural to
> follow the traditional open(), hw_params(), hw_free, close() from ALSA.

I think this just makes it worse, you are now adding even more
calls. The problem I see here is that, open and close (at least to
me) strongly implies that you can do multiple operations between
them and unless I have misunderstood something here you can't.

Thanks,
Charles
Pierre-Louis Bossart Dec. 18, 2023, 4:33 p.m. UTC | #8
On 12/18/23 08:29, Charles Keepax wrote:
> On Mon, Dec 18, 2023 at 01:58:47PM +0100, Pierre-Louis Bossart wrote:
>>> why not have a single API that does both? First check if it is supported
>>> and then allocate buffers and do the transfer.. What are the advantages
>>> of using this two step process
>>
>> Symmetry is the only thing that comes to my mind. Open - close and send
>> - wait are natural matches, aren't they?
>>
>> We do need a wait(), so bundling open() and send() would be odd.
>>
> 
> I agree send->wait->close would be odd, But you just bundle close
> into wait. So the API becomes just send->wait, which seems pretty
> logical.

Fair enough, send()/wait() would work indeed.

I guess I wanted to keep the callbacks reasonably small (already 200
lines for the open), but we can split the 'send' callback into smaller
helpers to keep the code readable. There's no good reason to expose
these smaller helpers to codec drivers.

>> But you have a point that the open() is not generic in that it also
>> prepares the DMA buffers for transmission. Maybe it's more natural to
>> follow the traditional open(), hw_params(), hw_free, close() from ALSA.
> 
> I think this just makes it worse, you are now adding even more
> calls. The problem I see here is that, open and close (at least to
> me) strongly implies that you can do multiple operations between
> them and unless I have misunderstood something here you can't.

That's right, the open was not compatible with multiple operations.
Collapsing open/send and wait/close sounds more logical, thanks for the
feedback.
Vinod Koul Dec. 21, 2023, 2:44 p.m. UTC | #9
On 18-12-23, 13:58, Pierre-Louis Bossart wrote:
> 
> >>  Documentation/driver-api/soundwire/bra.rst    | 478 ++++++++++++++++++
> > 
> > Can we split the cadence parts of this to bra-cadence.rst that way this
> > file documents the core parts only
> 
> Yes, we can split the Cadence parts out.

Great

> 
> 
> >> +Error handling
> >> +~~~~~~~~~~~~~~
> >> +
> >> +The expected response to a 'bra_message' and follow-up behavior may
> >> +vary:
> >> +
> >> +   (1) A Peripheral driver may want to receive an immediate -EBUSY
> >> +       response if the BRA protocol is not available at a given time.
> >> +
> >> +   (2) A Peripheral driver may want to wait until a timeout for the
> >> +       on-going transfer to be handled
> >> +
> >> +   (3) A Peripheral driver may want to wait until existing BRA
> >> +       transfers complete or deal with BRA as a background task when
> >> +       audio transfers stop. In this case, there would be no timeout,
> >> +       and the operation may not happen if the platform is suspended.
> > 
> > Is this runtime suspend or S3/S4 case?
> 
> System suspend (which can also mean S0i1).
> 
> I don't think we can have a case where a peripheral driver waits on
> something without having done a pm_runtime_get_sync() to prevent
> runtime_pm suspend.
> 
> > 
> >> +BTP/BRA API available to peripheral drivers
> >> +-------------------------------------------
> >> +
> >> +ASoC Peripheral drivers may use
> >> +
> >> +   - sdw_bpt_stream_open(mode)
> >> +
> >> +      This function verifies that the BPT protocol with the
> >> +      'mode'. For now only BRA is accepted as a mode. This function
> >> +      allocates a work buffer internally. This buffer is not exposed
> >> +      to the caller.
> >> +
> >> +     errors:
> >> +         -ENODEV: BPT/BRA is not supported by the Manager.
> >> +
> >> +         -EBUSY: another agent is already using the audio payload for
> >> +          audio transfers. There is no way to predict when the audio
> >> +          streams might stop, this will require the Peripheral driver
> >> +          to fall back to the regular (slow) command channel.
> >> +
> >> +         -EAGAIN: another agent is already transferring data using the
> >> +          BPT/BRA protocol. Since the transfers will typically last
> >> +          10s or 100s of ms, the Peripheral driver may wait and retry
> >> +          later.
> >> +
> >> +    - sdw_bpt_message_send_async(bpt_message)
> > 
> > why not have a single API that does both? First check if it is supported
> > and then allocate buffers and do the transfer.. What are the advantages
> > of using this two step process
> 
> Symmetry is the only thing that comes to my mind. Open - close and send
> - wait are natural matches, aren't they?

Why have symmetry to DAI apis, why not symmetry to regmap write APIs..?
This is data transfer, so I am not sure why would we model it as a DAI.
(Internal implementation may rely on that but from API design, i dont
think that should be a concern)

> 
> We do need a wait(), so bundling open() and send() would be odd.
> 
> But you have a point that the open() is not generic in that it also
> prepares the DMA buffers for transmission. Maybe it's more natural to
> follow the traditional open(), hw_params(), hw_free, close() from ALSA.
Vinod Koul Dec. 21, 2023, 2:45 p.m. UTC | #10
On 18-12-23, 17:33, Pierre-Louis Bossart wrote:
> 
> 
> On 12/18/23 08:29, Charles Keepax wrote:
> > On Mon, Dec 18, 2023 at 01:58:47PM +0100, Pierre-Louis Bossart wrote:
> >>> why not have a single API that does both? First check if it is supported
> >>> and then allocate buffers and do the transfer.. What are the advantages
> >>> of using this two step process
> >>
> >> Symmetry is the only thing that comes to my mind. Open - close and send
> >> - wait are natural matches, aren't they?
> >>
> >> We do need a wait(), so bundling open() and send() would be odd.
> >>
> > 
> > I agree send->wait->close would be odd, But you just bundle close
> > into wait. So the API becomes just send->wait, which seems pretty
> > logical.
> 
> Fair enough, send()/wait() would work indeed.
> 
> I guess I wanted to keep the callbacks reasonably small (already 200
> lines for the open), but we can split the 'send' callback into smaller
> helpers to keep the code readable. There's no good reason to expose
> these smaller helpers to codec drivers.

Yes! that would be a better design IMO

> 
> >> But you have a point that the open() is not generic in that it also
> >> prepares the DMA buffers for transmission. Maybe it's more natural to
> >> follow the traditional open(), hw_params(), hw_free, close() from ALSA.
> > 
> > I think this just makes it worse, you are now adding even more
> > calls. The problem I see here is that, open and close (at least to
> > me) strongly implies that you can do multiple operations between
> > them and unless I have misunderstood something here you can't.
> 
> That's right, the open was not compatible with multiple operations.
> Collapsing open/send and wait/close sounds more logical, thanks for the
> feedback.

Sure
Pierre-Louis Bossart Aug. 20, 2024, 7:48 a.m. UTC | #11
Hi Mark,

>> +One possible strategy to speed-up all initialization tasks would be to
>> +start a BRA transfer for firmware download, then deal with all the
>> +"regular" read/writes in parallel with the command channel, and last
>> +to wait for the BRA transfers to complete. This would allow for a
>> +degree of overlap instead of a purely sequential solution. As a
>> +results, the BRA API must support async transfers and expose a
>> +separate wait function.
> 
> This feels a lot like it could map onto the regmap async interface, it
> would need a bit of a rework (mainly that currently they provide
> ordering guarantees with respect to both each other and sync I/O) but
> those could be removed with some care) but it's got the "here's a list
> of I/O, here's another call to ensure it's all actually happened" thing.
> It sounds very much like how I was thinking of the async API when I was
> writing it and the initial users.
> 
> It's this bit that really got me thinking it could fit well into regmap.

I am now revisiting this entire patchset to try to enable firmware
downloads with this SoundWire BPT/BRA protocol. I re-looked at
regmap_raw_write_async()/regmap_async_complete() and they seem to map
well with my initial write/wait_async proposal.

Do I get this right that these async capabilities could be used to
enable this faster SoundWire protocol, but the regular regmap_write()
could still happen in parallel with these async transfers? This could be
useful if e.g. there's a jack detection while downloading firmware for
another function.

I don't see any locking that would prevent such a dual operation - with
the caveat that it would be up to the codec driver to make sure they
don't try to access the same register with the two modes of operation.

The only thing that would need to be extended is that we'd need
additional callbacks to check if the protocol is supported on a given
hardware/firmware platform, and if there's no audio stream. In these
cases the async writes would be demoted to regular writes. Otherwise the
bus would be locked to make sure no reconfiguration takes place while
the firmware download happens, and unlocked when the transfer ends.

Thanks for your help on all this,
-Pierre
Mark Brown Aug. 20, 2024, 11:53 a.m. UTC | #12
On Tue, Aug 20, 2024 at 09:48:05AM +0200, Pierre-Louis Bossart wrote:

> > This feels a lot like it could map onto the regmap async interface, it
> > would need a bit of a rework (mainly that currently they provide
> > ordering guarantees with respect to both each other and sync I/O) but
> > those could be removed with some care) but it's got the "here's a list
> > of I/O, here's another call to ensure it's all actually happened" thing.
> > It sounds very much like how I was thinking of the async API when I was
> > writing it and the initial users.

> > It's this bit that really got me thinking it could fit well into regmap.

> Do I get this right that these async capabilities could be used to
> enable this faster SoundWire protocol, but the regular regmap_write()
> could still happen in parallel with these async transfers? This could be
> useful if e.g. there's a jack detection while downloading firmware for
> another function.

As far as the regmap core is concerned, yes - with SPI we do wind up
with ordering but that's because the SPI sync API is a thin wrapper
around the SPI async API rather than anything regmap does.

> The only thing that would need to be extended is that we'd need
> additional callbacks to check if the protocol is supported on a given
> hardware/firmware platform, and if there's no audio stream. In these
> cases the async writes would be demoted to regular writes. Otherwise the
> bus would be locked to make sure no reconfiguration takes place while
> the firmware download happens, and unlocked when the transfer ends.

Those callbacks seem reasonable.  We already do the demotion to sync for
buses that don't have async, it'd just need to be made dynamic.
Pierre-Louis Bossart Aug. 20, 2024, 2:58 p.m. UTC | #13
On 8/20/24 13:53, Mark Brown wrote:
> On Tue, Aug 20, 2024 at 09:48:05AM +0200, Pierre-Louis Bossart wrote:
> 
>>> This feels a lot like it could map onto the regmap async interface, it
>>> would need a bit of a rework (mainly that currently they provide
>>> ordering guarantees with respect to both each other and sync I/O) but
>>> those could be removed with some care) but it's got the "here's a list
>>> of I/O, here's another call to ensure it's all actually happened" thing.
>>> It sounds very much like how I was thinking of the async API when I was
>>> writing it and the initial users.
> 
>>> It's this bit that really got me thinking it could fit well into regmap.
> 
>> Do I get this right that these async capabilities could be used to
>> enable this faster SoundWire protocol, but the regular regmap_write()
>> could still happen in parallel with these async transfers? This could be
>> useful if e.g. there's a jack detection while downloading firmware for
>> another function.
> 
> As far as the regmap core is concerned, yes - with SPI we do wind up
> with ordering but that's because the SPI sync API is a thin wrapper
> around the SPI async API rather than anything regmap does.

ok. I am struggling a bit to adjust the plumbing that was obviously
defined for SPI.

The regmap async_write() doesn't have any wait, but there's a notifier
that needs to be called by *something* that waits. I think the SPI layer
has a set of kthreads but in our case we could just rely on a a
workqueue after 1ms or something and do the wait then for the DMAs to
complete and finally call the notifier to wake-up the regmap stuff.

>> The only thing that would need to be extended is that we'd need
>> additional callbacks to check if the protocol is supported on a given
>> hardware/firmware platform, and if there's no audio stream. In these
>> cases the async writes would be demoted to regular writes. Otherwise the
>> bus would be locked to make sure no reconfiguration takes place while
>> the firmware download happens, and unlocked when the transfer ends.
> 
> Those callbacks seem reasonable.  We already do the demotion to sync for
> buses that don't have async, it'd just need to be made dynamic.

sounds good.
Mark Brown Aug. 20, 2024, 3:09 p.m. UTC | #14
On Tue, Aug 20, 2024 at 04:58:30PM +0200, Pierre-Louis Bossart wrote:
> On 8/20/24 13:53, Mark Brown wrote:

> > As far as the regmap core is concerned, yes - with SPI we do wind up
> > with ordering but that's because the SPI sync API is a thin wrapper
> > around the SPI async API rather than anything regmap does.

> ok. I am struggling a bit to adjust the plumbing that was obviously
> defined for SPI.

> The regmap async_write() doesn't have any wait, but there's a notifier
> that needs to be called by *something* that waits. I think the SPI layer
> has a set of kthreads but in our case we could just rely on a a
> workqueue after 1ms or something and do the wait then for the DMAs to
> complete and finally call the notifier to wake-up the regmap stuff.

From the regmap point of view the expectation is that when the caller
wants to know that all the I/Os have actually happened it should call
regmap_async_complete() which will do the actual blocking.  regmap
itself won't autonomously wait for anything, it's up to the caller to do
that (presumably it will at some point care).
diff mbox series

Patch

diff --git a/Documentation/driver-api/soundwire/bra.rst b/Documentation/driver-api/soundwire/bra.rst
new file mode 100644
index 000000000000..4cc934bf614d
--- /dev/null
+++ b/Documentation/driver-api/soundwire/bra.rst
@@ -0,0 +1,478 @@ 
+==========================
+Bulk Register Access (BRA)
+==========================
+
+Conventions
+-----------
+
+Capitalized words used in this documentation are intentional and refer
+to concepts of the SoundWire 1.x specification.
+
+Introduction
+------------
+
+The SoundWire 1.x specification provides a mechanism to speed-up
+command/control transfers by reclaiming parts of the audio
+bandwidth. The Bulk Register Access (BRA) protocol is a standard
+solution based on the Bulk Payload Transport (BPT) definitions.
+
+The regular control channel uses Column 0 and can only send/retrieve
+one byte per frame with write/read commands. With a typical 48kHz
+frame rate, only 48kB/s can be transferred.
+
+The optional Bulk Register Access capability can transmit up to 12
+Mbits/s and reduce transfer times by several orders of magnitude, but
+has multiple design constraints:
+
+  (1) Each frame can only support a read or a write transfer, with a
+      10-byte overhead per frame (header and footer response).
+
+  (2) The read/writes SHALL be from/to contiguous register addresses
+      in the same frame. A fragmented register space decreases the
+      efficiency of the protocol by requiring multiple BRA transfers
+      scheduled in different frames.
+
+  (3) The targeted Peripheral device SHALL support the optional Data
+      Port 0, and likewise the Manager SHALL expose audio-like Ports
+      to insert BRA packets in the audio payload using the concepts of
+      Sample Interval, HSTART, HSTOP, etc.
+
+  (4) The BRA transport efficiency depends on the available
+      bandwidth. If there are no on-going audio transfers, the entire
+      frame minus Column 0 can be reclaimed for BRA. The frame shape
+      also impacts efficiency: since Column0 cannot be used for
+      BTP/BRA, the frame should rely on a large number of columns and
+      minimize the number of rows. The bus clock should be as high as
+      possible.
+
+  (5) The number of bits transferred per frame SHALL be a multiple of
+      8 bits. Padding bits SHALL be inserted if necessary at the end
+      of the data.
+
+  (6) The regular read/write commands can be issued in parallel with
+      BRA transfers. This is convenient to e.g. deal with alerts, jack
+      detection or change the volume during firmware download, but
+      accessing the same address with two independent protocols has to
+      be avoided to avoid undefined behavior.
+
+  (7) Some implementations may not be capable of handling the
+      bandwidth of the BRA protocol, e.g. in the case of a slow I2C
+      bus behind the SoundWire IP. In this case, the transfers may
+      need to be spaced in time or flow-controlled.
+
+  (8) Each BRA packet SHALL be marked as 'Active' when valid data is
+      to be transmitted. This allows for software to allocate a BRA
+      stream but not transmit/discard data while processing the
+      results or preparing the next batch of data, or allowing the
+      peripheral to deal with the previous transfer. In addition BRA
+      transfer can be started early on without data being ready.
+
+  (9) Up to 470 bytes may be transmitted per frame.
+
+  (10) The address is represented with 32 bits and does not rely on
+       the paging registers used for the regular command/control
+       protocol in Column 0.
+
+
+Error checking
+--------------
+
+Firmware download is one of the key usages of the Bulk Register Access
+protocol. To make sure the binary data integrity is not compromised by
+transmission or programming errors, each BRA packet provides:
+
+  (1) A CRC on the 7-byte header. This CRC helps the Peripheral Device
+      check if it is addressed and set the start address and number of
+      bytes. The Peripheral Device provides a response in Byte 7.
+
+  (2) A CRC on the data block (header excluded). This CRC is
+      transmitted as the last-but-one byte in the packet, prior to the
+      footer response.
+
+The header response can be one of
+  (a) Ack
+  (b) Nak
+  (c) Not Ready
+
+The footer response can be one of
+  (1) Ack
+  (2) Nak  (CRC failure)
+  (3) Good (operation completed)
+  (4) Bad  (operation failed)
+
+Example frame
+-------------
+
+The example below is not to scale and makes simplifying assumptions
+for clarity. The different chunks in the BRA packets are not required
+to start on a new SoundWire Row, and the scale of data may vary.
+
+      ::
+
+	+---+--------------------------------------------+
+	+   |                                            |
+	+   |             BRA HEADER                     |
+        +   |	                                         |
+	+   +--------------------------------------------+
+	+ C |             HEADER CRC                     |
+	+ O +--------------------------------------------+
+	+ M | 	          HEADER RESPONSE                |
+	+ M +--------------------------------------------+
+	+ A |                                            |
+	+ N |                                            |
+	+ D |                 DATA                       |
+	+   |                                            |
+	+   |                                            |
+	+   |                                            |
+	+   +--------------------------------------------+
+	+   |             DATA CRC                       |
+	+   +--------------------------------------------+
+	+   | 	          FOOTER RESPONSE                |
+	+---+--------------------------------------------+
+
+
+Assuming the frame uses N columns, the configuration shown above can
+be programmed by setting the DP0 registers as:
+
+    - HSTART = 1
+    - HSTOP = N - 1
+    - Sampling Interval = N
+    - WordLength = N - 1
+
+Addressing restrictions
+-----------------------
+
+The Device Number specified in the Header follows the SoundWire
+definitions, and broadcast and group addressing are
+permitted. However, in reality it is very unlikely that the exact same
+binary data needs to be provided to the two different Peripheral
+devices. The Linux implementation only allows for transfers to a
+single device at a time.
+
+In the case of multiple Peripheral devices attached to different
+Managers, the broadcast and group addressing is not supported by the
+SoundWire specification. Each device must be handled with separate BRA
+streams, possibly in parallel - the links are really independent.
+
+Unsupported features
+--------------------
+
+The Bulk Register Access specification provides a number of
+capabilities that are not supported in known implementations, such as:
+
+  (1) Transfers initiated by a Peripheral Device. The BRA Initiator is
+      always the Manager Device.
+
+  (2) Flow-control capabilities and retransmission based on the
+      'NotReady' header response require extra buffering in the
+      SoundWire IP and are not implemented.
+
+Bi-directional handling
+-----------------------
+
+The BRA protocol can handle writes as well as reads, and in each
+packet the header and footer response are provided by the Peripheral
+Target device. On the Peripheral device, the BRA protocol is handled
+by a single DP0 data port, and at the low-level the bus ownership can
+will change for header/footer response as well as the data transmitted
+during a read.
+
+On the host side, most implementations rely on a Port-like concept,
+with two FIFOs consuming/generating data transfers in parallel
+(Host->Peripheral and Peripheral->Host). The amount of data
+consumed/produced by these FIFOs is not symmetrical, as a result
+hardware typically inserts markers to help software and hardware
+interpret raw data
+
+Each packet will typically have
+
+  (1) a 'Start of Packet' indicator.
+
+  (2) an 'End of Packet' indicator.
+
+  (3) a packet identifier to correlate the data requested and
+      transmitted, and the error status for each frame
+
+Hardware implementations can check errors at the frame level, and
+retry a transfer in case of errors. However, as for the flow-control
+case, this requires extra buffering and intelligence in the
+hardware. The Linux support assumes that the entire transfer is
+cancelled if a single error is detected in one of the responses.
+
+Cadence IP BRA support
+----------------------
+
+Format requirements
+~~~~~~~~~~~~~~~~~~~
+
+The Cadence IP relies on PDI0 for TX and PDI1 for RX. The data needs
+to be formatted with the following conventions:
+
+  (1) all Data is stored in bits 15..0 of the 32-bit PDI FIFOs.
+
+  (2) the start of packet is BIT(31).
+
+  (3) the end of packet is BIT(30).
+
+  (4) A packet ID is stored in bits 19..16. This packet ID is
+      determined by software and is typically a rolling counter.
+
+  (5) Padding shall be inserted as needed so that the Header CRC,
+      Header response, Footer CRC, Footer response are always in
+      Byte0. Padding is inserted by software for writes, and on reads
+      software shall discard the padding added by the hardware.
+
+Example format
+~~~~~~~~~~~~~~
+
+The following table represents the sequence provided to PDI0 for a
+write command followed by a read command.
+
+::
+
+	+---+---+--------+---------------+---------------+
+	+ 1 | 0 | ID = 0 |  WR HDR[1]    |  WR HDR[0]    |
+	+   |   |        |  WR HDR[3]    |  WR HDR[2]    |
+	+   |   |        |  WR HDR[5]    |  WR HDR[4]    |
+	+   |   |        |  pad          |  WR HDR CRC   |
+	+   |   |        |  WR Data[1]   |  WR Data[0]   |
+	+   |   |        |  WR Data[3]   |  WR Data[2]   |
+	+   |   |        |  WR Data[n-2] |  WR Data[n-3] |
+	+   |   |        |  pad          |  WR Data[n-1] |
+	+ 0 | 1 |        |  pad          |  WR Data CRC  |
+	+---+---+--------+---------------+---------------+
+	+ 1 | 0 | ID = 1 |  RD HDR[1]    |  RD HDR[0]    |
+	+   |   |        |  RD HDR[3]    |  RD HDR[2]    |
+	+   |   |        |  RD HDR[5]    |  RD HDR[4]    |
+	+ 0 | 1 |        |  pad          |  RD HDR CRC   |
+	+---+---+--------+---------------+---------------+
+
+
+The table below represents the data received on PDI1 for the same
+write command followed by a read command.
+
+::
+
+	+---+---+--------+---------------+---------------+
+	+ 1 | 0 | ID = 0 |  pad          |  WR Hdr Rsp   |
+	+ 0 | 1 |        |  pad          |  WR Ftr Rsp   |
+	+---+---+--------+---------------+---------------+
+	+ 1 | 0 | ID = 0 |  pad          |  Rd Hdr Rsp   |
+	+   |   |        |  RD Data[1]   |  RD Data[0]   |
+	+   |   |        |  RD Data[3]   |  RD Data[2]   |
+	+   |   |        |  RD HDR[n-2]  |  RD Data[n-3] |
+	+   |   |        |  pad          |  RD Data[n-1] |
+	+   |   |        |  pad          |  RD Data CRC  |
+	+ 0 | 1 |        |  pad          |  RD Ftr Rsp   |
+	+---+---+--------+---------------+---------------+
+
+
+Peripheral/bus interface
+------------------------
+
+Regmap use
+~~~~~~~~~~
+
+Existing codec drivers rely on regmap to download firmware to
+Peripherals, so at a high-level it would seem natural to combine BRA
+and regmap. The regmap layer could check if BRA is available or not,
+and use a regular read-write command channel in the latter case.
+
+However, the regmap layer does not have information on latency or how
+critical a BRA transfer is. It would make more sense to let the codec
+driver make decisions (wait, timeout, fallback to regular
+read/writes).
+
+In addition, the hardware may lose context and ask for the firmware to
+be downloaded again. The firmware is not however a fixed definition,
+the SDCA definition allows the hardware to request an updated firmware
+or a different coefficient table to deal with specific environment
+conditions. In other words, traditional regmap cache management is not
+good enough, the Peripheral driver is required track hardware
+notifications and react accordingly.
+
+Abstraction required
+~~~~~~~~~~~~~~~~~~~~
+
+There are no standard registers or mandatory implementation at the
+Manager level, so the low-level BPT/BRA details must be hidden in
+Manager-specific code. For example the Cadence IP format above is not
+known to the codec drivers.
+
+Likewise, codec drivers should not have to know the frame size. The
+computation of CRC and handling of responses is handled in helpers and
+Manager-specific code.
+
+The host BRA driver may also have restrictions on pages allocated for
+DMA, or other host-DSP communication protocols. The codec driver
+should not be aware of any of these restrictions, since it might be
+reused in combination with different implementations of Manager IPs.
+
+Concurrency between BRA and regular read/write
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The existing 'nread/nwrite' API already relies on a notion of start
+address and number of bytes, so it would be possible to extend this
+API with a 'hint' requesting BPT/BRA be used.
+
+However BRA transfers could be quite long, and the use of a single
+mutex for regular read/write and BRA is a show-stopper. Independent
+operation of the control/command and BRA transfers is a fundamental
+requirement, e.g. to change the volume level with the existing regmap
+interface while downloading firmware.
+
+This places the burden on the codec driver to verify that there is no
+concurrent access to the same address with the command/control
+protocol and the BRA protocol.
+
+In addition, the 'sdw_msg' structure hard-codes support for 16-bit
+addresses and paging registers which are irrelevant for BPT/BRA
+support based on native 32-bit addresses. A separate API with
+'sdw_bpt_msg' makes more sense.
+
+One possible strategy to speed-up all initialization tasks would be to
+start a BRA transfer for firmware download, then deal with all the
+"regular" read/writes in parallel with the command channel, and last
+to wait for the BRA transfers to complete. This would allow for a
+degree of overlap instead of a purely sequential solution. As a
+results, the BRA API must support async transfers and expose a
+separate wait function.
+
+Error handling
+~~~~~~~~~~~~~~
+
+The expected response to a 'bra_message' and follow-up behavior may
+vary:
+
+   (1) A Peripheral driver may want to receive an immediate -EBUSY
+       response if the BRA protocol is not available at a given time.
+
+   (2) A Peripheral driver may want to wait until a timeout for the
+       on-going transfer to be handled
+
+   (3) A Peripheral driver may want to wait until existing BRA
+       transfers complete or deal with BRA as a background task when
+       audio transfers stop. In this case, there would be no timeout,
+       and the operation may not happen if the platform is suspended.
+
+BRA stream model
+----------------
+
+For regular audio transfers, the machine driver exposes a dailink
+connecting CPU DAI(s) and Codec DAI(s).
+
+This model is not required BRA support:
+
+   (1) The SoundWire DAIs are mainly wrappers for SoundWire Data
+       Ports, with possibly some analog or audio conversion
+       capabilities bolted behind the Data Port. In the context of
+       BRA, the DP0 is the destination. DP0 registers are standard and
+       can be programmed blindly without knowing what Peripheral is
+       connected to each link. In addition, if there are multiple
+       Peripherals on a link and some of them do not support DP0, the
+       write commands to program DP0 registers will generate harmless
+       COMMAND_IGNORED responses that will be wired-ORed with
+       responses from Peripherals which support DP0. In other words,
+       the DP0 programming can be done with broadcast commands, and
+       the information on the Target device can be added only in the
+       BRA Header.
+
+   (2) At the CPU level, the DAI concept is not useful for BRA; the
+       machine driver will not create a dailink relying on DP0. The
+       only concept that is needed is the notion of port.
+
+   (3) The stream concept relies on a set of master_rt and slave_rt
+       concepts. All of these entities represent ports and not DAIs.
+
+   (4) With the assumption that a single BRA stream is used per link,
+       that stream can connect master ports as well as all peripheral
+       DP0 ports.
+
+   (5) BRA transfers only make sense in the concept of one
+       Manager/Link, so the BRA stream handling does not rely on the
+       concept of multi-link aggregation allowed by regular DAI links.
+
+Audio DMA support
+-----------------
+
+Some DMAs, such as HDaudio, require an audio format field to be
+set. This format is in turn used to define acceptable bursts. BPT/BRA
+support is not fully compatible with these definitions in that the
+format may vary between read and write commands.
+
+In addition, on Intel HDaudio Intel platforms the DMAs need to be
+programmed with a PCM format matching the bandwidth of the BPT/BRA
+transfer. The format is based on 48kHz 32-bit samples, and the number
+of channels varies to adjust the bandwidth. The notion of channel is
+completely notional since the data is not typical audio
+PCM. Programming channels helps reserve enough bandwidth and adjust
+FIFO sizes to avoid xruns. Note that the quality of service comes as a
+cost. Since all channels need to be present as a sample block, data
+sizes not aligned to 128-bytes are not supported.
+
+BTP/BRA API available to peripheral drivers
+-------------------------------------------
+
+ASoC Peripheral drivers may use
+
+   - sdw_bpt_stream_open(mode)
+
+      This function verifies that the BPT protocol with the
+      'mode'. For now only BRA is accepted as a mode. This function
+      allocates a work buffer internally. This buffer is not exposed
+      to the caller.
+
+     errors:
+         -ENODEV: BPT/BRA is not supported by the Manager.
+
+         -EBUSY: another agent is already using the audio payload for
+          audio transfers. There is no way to predict when the audio
+          streams might stop, this will require the Peripheral driver
+          to fall back to the regular (slow) command channel.
+
+         -EAGAIN: another agent is already transferring data using the
+          BPT/BRA protocol. Since the transfers will typically last
+          10s or 100s of ms, the Peripheral driver may wait and retry
+          later.
+
+    - sdw_bpt_message_send_async(bpt_message)
+
+      This function sends the data using the Manager
+      implementation-defined capabilities (typically DMA or IPC
+      protocol). If the message exceeds the size of the buffer
+      allocated in the 'open' stage, the data will be copied and
+      transmitted in multiple chunks using the buffer. This function
+      cannot be called multiple times to queue transfers, the codec
+      driver needs to wait for completion of the requested transfer.
+
+      errors:
+
+         -ENODEV: BPT/BRA is not supported by the Manager.
+
+	 -EINVAL: no resources available.
+
+    - sdw_bpt_message_wait(timeout)
+
+      This function waits for the entire message provided by the codec
+      driver in the 'send_async' stage. Intermediate status for
+      smaller chunks will not be provided back to the codec driver,
+      only a return code will be provided.
+
+      errors:
+
+         -ENODEV: BPT/BRA is not supported by the Manager.
+
+	 -EINVAL: no transfer queued
+
+         -EIO: some sort of transmisson error happened, typically a
+          bad CRC was detected.
+
+	 -ETIMEDOUT: transfer did not complete
+
+    - sdw_bpt_stream_close()
+
+      This functions releases the buffer allocated in the open stage
+      and decreases the refcounts.
+
+      Note that it's possible to call send_async/message_wait multiple
+      times, it's not required to close/open.
diff --git a/Documentation/driver-api/soundwire/index.rst b/Documentation/driver-api/soundwire/index.rst
index 234911a0db99..8d826fd5781f 100644
--- a/Documentation/driver-api/soundwire/index.rst
+++ b/Documentation/driver-api/soundwire/index.rst
@@ -9,6 +9,7 @@  SoundWire Documentation
    stream
    error_handling
    locking
+   bra
 
 .. only::  subproject and html
 
diff --git a/Documentation/driver-api/soundwire/summary.rst b/Documentation/driver-api/soundwire/summary.rst
index 01dcb954f6d7..260a1c78545e 100644
--- a/Documentation/driver-api/soundwire/summary.rst
+++ b/Documentation/driver-api/soundwire/summary.rst
@@ -187,10 +187,7 @@  reconfigurations.
 Future enhancements to be done
 ==============================
 
- (1) Bulk Register Access (BRA) transfers.
-
-
- (2) Multiple data lane support.
+ (1) Multiple data lane support.
 
 Links
 =====