[0/3] Apple Mailbox Controller support

Message ID	20210907145501.69161-1-sven@svenpeter.dev
Headers	show Return-Path: <devicetree-owner@kernel.org> From: Sven Peter <sven@svenpeter.dev> To: Jassi Brar <jassisinghbrar@gmail.com> Cc: Sven Peter <sven@svenpeter.dev>, Rob Herring <robh+dt@kernel.org>, Mark Kettenis <mark.kettenis@xs4all.nl>, Hector Martin <marcan@marcan.st>, Alyssa Rosenzweig <alyssa@rosenzweig.io>, Mohamed Mediouni <mohamed.mediouni@caramail.com>, Stan Skowronek <stan@corellium.com>, devicetree@vger.kernel.org, linux-arm-kernel@lists.infradead.org, linux-kernel@vger.kernel.org Subject: [PATCH 0/3] Apple Mailbox Controller support Date: Tue, 7 Sep 2021 16:54:58 +0200 Message-Id: <20210907145501.69161-1-sven@svenpeter.dev> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	Apple Mailbox Controller support \| expand [0/3] Apple Mailbox Controller support [2/3] dt-bindings: mailbox: Add Apple mailbox bindings

Message ID

20210907145501.69161-1-sven@svenpeter.dev

Headers

From: Sven Peter <sven@svenpeter.dev>
To: Jassi Brar <jassisinghbrar@gmail.com>
Cc: Sven Peter <sven@svenpeter.dev>, Rob Herring <robh+dt@kernel.org>,
	Mark Kettenis <mark.kettenis@xs4all.nl>,
	Hector Martin <marcan@marcan.st>,
	Alyssa Rosenzweig <alyssa@rosenzweig.io>,
	Mohamed Mediouni <mohamed.mediouni@caramail.com>,
	Stan Skowronek <stan@corellium.com>,
	devicetree@vger.kernel.org, linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org
Subject: [PATCH 0/3] Apple Mailbox Controller support
Date: Tue,  7 Sep 2021 16:54:58 +0200
Message-Id: <20210907145501.69161-1-sven@svenpeter.dev>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Precedence: bulk

Series

Apple Mailbox Controller support | expand

Message

Sven Peter Sept. 7, 2021, 2:54 p.m. UTC

Hi,

This series adds support for the mailbox HW found in the Apple M1. These SoCs
have various co-processors controlling different peripherals (NVMe, display
controller, SMC (required for WiFi), Thunderbolt, and probably more
we don't know about yet). All these co-processors communicate with the main CPU
using these mailboxes. These mailboxes transmit 64+32 bit messages, are
backed by a hardware FIFO and have four interrupts (FIFO empty and FIFO not
empty for the transmit and receive FIFO each).

The hardware itself allows to send 64+32 bit message using two hardware
registers. A write to or read from the second register transmits or receives a
message. Usually, the first 64 bit register is used for the message itself and 
8 bits of the second register are used as an endpoint. I originally considered
to have the endpoint exposed as a mailbox-channel, but finally decided against
it: The hardware itself only provides a single channel to the co-processor and
the endpoint bits are only an implementation detail of the firmware. There's
even one co-processor (SEP) which uses 8 bits of the first register as its
endpoint number instead.
There was a similar discussion about the BCM2835 / Raspberry Pi mailboxes
which came to the same conclusion [1].

These mailboxes also have a hardware FIFO which make implementing them with the
current mailbox a bit tricky: There is no "transmission done" interrupt because
most transmissions are "done" immediately. There is only a "transmission fifo
empty" level interrupt. I have instead implemented this by adding a fast-path to
the core mailbox code as a new txready_fifo mode.
The other possibilities (which would not require any changes to the core mailbox
code) are to either use the polling mode or to enable the "tx fifo empty"
interrupt in send_message and then call txready from the irq handler before
disabling it again. I'd like to avoid those though since so far I've never seen
the TX FIFO run full which allows to almost always avoid the context switch when
sending a message. I can easily switch to one of these modes if you prefer to
keep the core code untouched though.


Best,

Sven

Sven Peter (3):
  mailbox: Add txdone_fifo
  dt-bindings: mailbox: Add Apple mailbox bindings
  mailbox: apple: Add driver for Apple mailboxes

 .../bindings/mailbox/apple,mailbox.yaml       |  81 ++++
 MAINTAINERS                                   |   3 +
 drivers/mailbox/Kconfig                       |  12 +
 drivers/mailbox/Makefile                      |   2 +
 drivers/mailbox/apple-mailbox.c               | 432 ++++++++++++++++++
 drivers/mailbox/mailbox.c                     |  66 ++-
 drivers/mailbox/mailbox.h                     |   1 +
 include/linux/apple-mailbox.h                 |  18 +
 include/linux/mailbox_controller.h            |  15 +
 9 files changed, 621 insertions(+), 9 deletions(-)
 create mode 100644 Documentation/devicetree/bindings/mailbox/apple,mailbox.yaml
 create mode 100644 drivers/mailbox/apple-mailbox.c
 create mode 100644 include/linux/apple-mailbox.h

Comments

Alyssa Rosenzweig Sept. 7, 2021, 9:06 p.m. UTC | #1

> > > + * Both the main CPU and the co-processor see the same set of registers but
> > > + * the first FIFO (A2I) is always used to transfer messages from the application
> > > + * processor (us) to the I/O processor and the second one (I2A) for the
> > > + * other direction.
> > 
> > Do we actually know what the copro sees? It seems obvious, but *know*?
> 
> Yes, I know. They see the exact same set of registers. I wouldn't have written
> it like that otherwise.

Ack.

> > > +struct apple_mbox {
> > > +	void __iomem *regs;
> > > +	const struct apple_mbox_hw *hw;
> > > +     ...
> > > +};
> > 
> > This requires a lot of pointer chasing to send/receive messages. Will
> > this become a perf issue in any case? It'd be good to get ballparks on
> > how often these mboxes are used. (For DCP, it doesn't matter, only a few
> > hundred messages per second. Other copros may have different behaviour)
> 
> If this actually becomes a problem I'm happy to fix it but right now
> this feels like premature optimization to me. DCP is going to be the
> worst offender followed by SMC (at most a few per second when it's really
> busy during boot time) and SEP (I've also never seen more than a few per
> second here). The rest of them are mostly quiet after they have booted.

Good enough for me -- it won't matter for DCP, so if it doesn't get any
worse than DCP there's no point in optimizing this.

> > > +static bool apple_mbox_hw_can_send(struct apple_mbox *apple_mbox)
> > > +{
> > > +	u32 mbox_ctrl =
> > > +		readl_relaxed(apple_mbox->regs + apple_mbox->hw->a2i_control);
> > > +
> > > +	return !(mbox_ctrl & apple_mbox->hw->control_full);
> > > +}
> > 
> > If you do the pointer chasing, I suspect you want accessor
> > functions/macros at least to make it less intrusive...
> 
> There's regmap but that can't easily be used here because I need 32bit
> and 64bit writes. I also don't really see the advantage of replacing
> this with some custom functions like e.g.
> 
> 	mbox_ctrl = apple_mbox_hw_readl(apple_mbox, apple_mbox->hw->a2i_control);
> 
> which is almost as long. And if I introduce a separate function just to read the
> control reg this just becomes a lot of boilerplate and harder to follow.
> 
> Or did you have anything else in mind?

I was envisioning a macro:

>	#define apple_mbox_readl(mbox, name) \
>		readl_relaxed(mbox->regs + mbox->hw-> ## name)
>
> 	mbox_ctrl = apple_mbox_readl(apple_mbox, a2i_control);

Now that I've typed it out, however, it seems a bit too magical to
justify the minor space savings. And given you need both 32b and 64b
there would be ~4 such macros which is also a lot of boilerplate.

It's not a huge deal either way but I thought I'd raise it.

> > > +	dev_dbg(apple_mbox->dev, "> TX %016llx %08x\n", msg->msg0, msg->msg1);
> > 
> > Isn't "TX" redundant here?
> 
> Sure, but it also doesn't hurt in a debug message. I can spot the TX easier
> but I'm sure there are people who prefer >.

Fair enough.

> > > +	dev_dbg(apple_mbox->dev, "got FIFO empty IRQ\n");
> > 
> > I realize it's a dev_dbg but this still seems unnecessarily noisy.
> 
> This code path is almost never reached. I've only been able to trigger
> it by forcing send_message to return EBUSY even when it could still
> move messages to the FIFO to test this path.
> This also can't be triggered more often than the TX debug message.
> If this triggers more often there's a bug somewhere that kept the interrupt
> enabled and then the whole machine will hang due an interrupt storm anyway. In
> that case I'd prefer to have this as noisy as possible.

Ah! Sure, makes sense. Is it worth adding a few words to the functions
comments indicating this rarely occurs in good conditions?

> > > +static irqreturn_t apple_mbox_recv_irq(int irq, void *data)
> > > +{
> > > +	struct apple_mbox *apple_mbox = data;
> > > +	struct apple_mbox_msg msg;
> > > +
> > > +	while (apple_mbox_hw_can_recv(apple_mbox)) {
> > > +		apple_mbox_hw_recv(apple_mbox, &msg);
> > > +		mbox_chan_received_data(&apple_mbox->chan, (void *)&msg);
> > > +	}
> > ```
> > 
> > A comment is warranted why this loop is safe and will always terminate,
> > especially given this is an IRQ context. (Ah... can a malicious
> > coprocessor DoS the AP by sending messages faster than the AP can
> > process them? freezing the system since preemption might be disabled
> > here? I suppose thee's no good way to protect against that level of
> > goofy.)
> 
> The only way this can't terminate is if the co-processor tries to stall
> us with messages, but what's your threat model here? If the co-processor wants
> to be evil it usually has many other ways to annoy us (e.g. ANS could just disable
> NVMe, SMC could just trigger a shutdown, etc.)
> 
> I could move this to threaded interrupt context though which would make that a bit
> harder to pull off at the cost of a bit more latency from incoming messages.
> Not sure that's worth it though.

Probably not worth it, no.

> > 
> > > +	 * There's no race if a message comes in between the check in the while
> > > +	 * loop above and the ack below: If a new messages arrives inbetween
> > > +	 * those two the interrupt will just fire again immediately after the
> > > +	 * ack since it's level sensitive.
> > 
> > I don't quite understand the reasoning. Why have IRQ controls at all
> > then on the M3?
> 
> Re-read the two lines just above this one. If the interrupt is not acked here
> it will keep firing even when there are no new messages.
> But it's fine to ack it even when there are message available since it will
> just trigger again immediately.

Got it, thanks.

> > > +	/*
> > > +	 * Only some variants of this mailbox HW provide interrupt control
> > > +	 * at the mailbox level. We therefore need to handle enabling/disabling
> > > +	 * interrupts at the main interrupt controller anyway for hardware that
> > > +	 * doesn't. Just always keep the interrupts we care about enabled at
> > > +	 * the mailbox level so that both hardware revisions behave almost
> > > +	 * the same.
> > > +	 */
> > 
> > Cute. Does macOS do this? Are there any tradeoffs?
> 
> Not sure if macOS does is and I also don't see a reason to care what it
> does exactly. I've verified that this works with the M3 mailboxes.
> I also don't see why there would there be any tradeoffs.
> Do you have anything specific in mind?
> 
> I suspect this HW was used in previous SoCs where all four interrupts
> shared a single line and required this chained interrupt controller
> at the mailbox level. But since they are all separate on the M1 it's
> just a nuisance we have to deal with - especially since the ASC
> variant got rid of the interrupt controls.

Makes sense for a legacy block 👍

Sven Peter Sept. 8, 2021, 3:38 p.m. UTC | #2

On Tue, Sep 7, 2021, at 23:06, Alyssa Rosenzweig wrote:
> > > > + * Both the main CPU and the co-processor see the same set of registers but

> > > > + * the first FIFO (A2I) is always used to transfer messages from the application

> > > > + * processor (us) to the I/O processor and the second one (I2A) for the

> > > > + * other direction.

> > > 

> > > Do we actually know what the copro sees? It seems obvious, but *know*?

> > 

> > Yes, I know. They see the exact same set of registers. I wouldn't have written

> > it like that otherwise.

> 

> Ack.

> 

> > > > +struct apple_mbox {

> > > > +	void __iomem *regs;

> > > > +	const struct apple_mbox_hw *hw;

> > > > +     ...

> > > > +};

> > > 

> > > This requires a lot of pointer chasing to send/receive messages. Will

> > > this become a perf issue in any case? It'd be good to get ballparks on

> > > how often these mboxes are used. (For DCP, it doesn't matter, only a few

> > > hundred messages per second. Other copros may have different behaviour)

> > 

> > If this actually becomes a problem I'm happy to fix it but right now

> > this feels like premature optimization to me. DCP is going to be the

> > worst offender followed by SMC (at most a few per second when it's really

> > busy during boot time) and SEP (I've also never seen more than a few per

> > second here). The rest of them are mostly quiet after they have booted.

> 

> Good enough for me -- it won't matter for DCP, so if it doesn't get any

> worse than DCP there's no point in optimizing this.

> 

> > > > +static bool apple_mbox_hw_can_send(struct apple_mbox *apple_mbox)

> > > > +{

> > > > +	u32 mbox_ctrl =

> > > > +		readl_relaxed(apple_mbox->regs + apple_mbox->hw->a2i_control);

> > > > +

> > > > +	return !(mbox_ctrl & apple_mbox->hw->control_full);

> > > > +}

> > > 

> > > If you do the pointer chasing, I suspect you want accessor

> > > functions/macros at least to make it less intrusive...

> > 

> > There's regmap but that can't easily be used here because I need 32bit

> > and 64bit writes. I also don't really see the advantage of replacing

> > this with some custom functions like e.g.

> > 

> > 	mbox_ctrl = apple_mbox_hw_readl(apple_mbox, apple_mbox->hw->a2i_control);

> > 

> > which is almost as long. And if I introduce a separate function just to read the

> > control reg this just becomes a lot of boilerplate and harder to follow.

> > 

> > Or did you have anything else in mind?

> 

> I was envisioning a macro:

> 

> >	#define apple_mbox_readl(mbox, name) \

> >		readl_relaxed(mbox->regs + mbox->hw-> ## name)

> >

> > 	mbox_ctrl = apple_mbox_readl(apple_mbox, a2i_control);

> 

> Now that I've typed it out, however, it seems a bit too magical to

> justify the minor space savings. And given you need both 32b and 64b

> there would be ~4 such macros which is also a lot of boilerplate.

> 

> It's not a huge deal either way but I thought I'd raise it.


Yeah, I've thought about this as well but this entire hardware is
so simple that we don't gain much from it imho.

> 

> > > > +	dev_dbg(apple_mbox->dev, "> TX %016llx %08x\n", msg->msg0, msg->msg1);

> > > 

> > > Isn't "TX" redundant here?

> > 

> > Sure, but it also doesn't hurt in a debug message. I can spot the TX easier

> > but I'm sure there are people who prefer >.

> 

> Fair enough.

> 

> > > > +	dev_dbg(apple_mbox->dev, "got FIFO empty IRQ\n");

> > > 

> > > I realize it's a dev_dbg but this still seems unnecessarily noisy.

> > 

> > This code path is almost never reached. I've only been able to trigger

> > it by forcing send_message to return EBUSY even when it could still

> > move messages to the FIFO to test this path.

> > This also can't be triggered more often than the TX debug message.

> > If this triggers more often there's a bug somewhere that kept the interrupt

> > enabled and then the whole machine will hang due an interrupt storm anyway. In

> > that case I'd prefer to have this as noisy as possible.

> 

> Ah! Sure, makes sense. Is it worth adding a few words to the functions

> comments indicating this rarely occurs in good conditions?


Sure, I can add a small comment if it makes the code easier to understand.

> 

> > > > +static irqreturn_t apple_mbox_recv_irq(int irq, void *data)

> > > > +{

> > > > +	struct apple_mbox *apple_mbox = data;

> > > > +	struct apple_mbox_msg msg;

> > > > +

> > > > +	while (apple_mbox_hw_can_recv(apple_mbox)) {

> > > > +		apple_mbox_hw_recv(apple_mbox, &msg);

> > > > +		mbox_chan_received_data(&apple_mbox->chan, (void *)&msg);

> > > > +	}

> > > ```

> > > 

> > > A comment is warranted why this loop is safe and will always terminate,

> > > especially given this is an IRQ context. (Ah... can a malicious

> > > coprocessor DoS the AP by sending messages faster than the AP can

> > > process them? freezing the system since preemption might be disabled

> > > here? I suppose thee's no good way to protect against that level of

> > > goofy.)

> > 

> > The only way this can't terminate is if the co-processor tries to stall

> > us with messages, but what's your threat model here? If the co-processor wants

> > to be evil it usually has many other ways to annoy us (e.g. ANS could just disable

> > NVMe, SMC could just trigger a shutdown, etc.)

> > 

> > I could move this to threaded interrupt context though which would make that a bit

> > harder to pull off at the cost of a bit more latency from incoming messages.

> > Not sure that's worth it though.

> 

> Probably not worth it, no.


One small advantage of the threaded interrupt would be that the mailbox clients could
detect the invalid messages and at least get a chance to shut the channel down if
the co-processor did this by accident.
Still not sure if that would actually help much but I guess it won't matter in
the end anyway. Changing this only requires to request a threaded irq in the probe
function so it's also not a big deal either way.


> 

> > > 

> > > > +	 * There's no race if a message comes in between the check in the while

> > > > +	 * loop above and the ack below: If a new messages arrives inbetween

> > > > +	 * those two the interrupt will just fire again immediately after the

> > > > +	 * ack since it's level sensitive.

> > > 

> > > I don't quite understand the reasoning. Why have IRQ controls at all

> > > then on the M3?

> > 

> > Re-read the two lines just above this one. If the interrupt is not acked here

> > it will keep firing even when there are no new messages.

> > But it's fine to ack it even when there are message available since it will

> > just trigger again immediately.

> 

> Got it, thanks.

> 

> > > > +	/*

> > > > +	 * Only some variants of this mailbox HW provide interrupt control

> > > > +	 * at the mailbox level. We therefore need to handle enabling/disabling

> > > > +	 * interrupts at the main interrupt controller anyway for hardware that

> > > > +	 * doesn't. Just always keep the interrupts we care about enabled at

> > > > +	 * the mailbox level so that both hardware revisions behave almost

> > > > +	 * the same.

> > > > +	 */

> > > 

> > > Cute. Does macOS do this? Are there any tradeoffs?

> > 

> > Not sure if macOS does is and I also don't see a reason to care what it

> > does exactly. I've verified that this works with the M3 mailboxes.

> > I also don't see why there would there be any tradeoffs.

> > Do you have anything specific in mind?

> > 

> > I suspect this HW was used in previous SoCs where all four interrupts

> > shared a single line and required this chained interrupt controller

> > at the mailbox level. But since they are all separate on the M1 it's

> > just a nuisance we have to deal with - especially since the ASC

> > variant got rid of the interrupt controls.

> 

> Makes sense for a legacy block 👍

> 



Best,


Sven

Jassi Brar Sept. 8, 2021, 8:48 p.m. UTC | #3

On Tue, Sep 7, 2021 at 9:55 AM Sven Peter <sven@svenpeter.dev> wrote:
>

> Hi,

>

> This series adds support for the mailbox HW found in the Apple M1. These SoCs

> have various co-processors controlling different peripherals (NVMe, display

> controller, SMC (required for WiFi), Thunderbolt, and probably more

> we don't know about yet). All these co-processors communicate with the main CPU

> using these mailboxes. These mailboxes transmit 64+32 bit messages, are

> backed by a hardware FIFO and have four interrupts (FIFO empty and FIFO not

> empty for the transmit and receive FIFO each).

>

> The hardware itself allows to send 64+32 bit message using two hardware

> registers. A write to or read from the second register transmits or receives a

> message. Usually, the first 64 bit register is used for the message itself and

> 8 bits of the second register are used as an endpoint. I originally considered

> to have the endpoint exposed as a mailbox-channel, but finally decided against

> it: The hardware itself only provides a single channel to the co-processor and

> the endpoint bits are only an implementation detail of the firmware. There's

> even one co-processor (SEP) which uses 8 bits of the first register as its

> endpoint number instead.

> There was a similar discussion about the BCM2835 / Raspberry Pi mailboxes

> which came to the same conclusion [1].

>

> These mailboxes also have a hardware FIFO which make implementing them with the

> current mailbox a bit tricky: There is no "transmission done" interrupt because

> most transmissions are "done" immediately. There is only a "transmission fifo

> empty" level interrupt. I have instead implemented this by adding a fast-path to

> the core mailbox code as a new txready_fifo mode.

> The other possibilities (which would not require any changes to the core mailbox

> code) are to either use the polling mode or to enable the "tx fifo empty"

> interrupt in send_message and then call txready from the irq handler before

> disabling it again. I'd like to avoid those though since so far I've never seen

> the TX FIFO run full which allows to almost always avoid the context switch when

> sending a message. I can easily switch to one of these modes if you prefer to

> keep the core code untouched though.

>

Yes, please keep the api unchanged.
Let us please not dig our own tunnels when the existing ways serve the purpose.

Thanks.

Sven Peter Sept. 9, 2021, 10:44 a.m. UTC | #4

On Wed, Sep 8, 2021, at 22:48, Jassi Brar wrote:
> On Tue, Sep 7, 2021 at 9:55 AM Sven Peter <sven@svenpeter.dev> wrote:

> >

> > Hi,

> >

> > This series adds support for the mailbox HW found in the Apple M1. These SoCs

> > have various co-processors controlling different peripherals (NVMe, display

> > controller, SMC (required for WiFi), Thunderbolt, and probably more

> > we don't know about yet). All these co-processors communicate with the main CPU

> > using these mailboxes. These mailboxes transmit 64+32 bit messages, are

> > backed by a hardware FIFO and have four interrupts (FIFO empty and FIFO not

> > empty for the transmit and receive FIFO each).

> >

> > The hardware itself allows to send 64+32 bit message using two hardware

> > registers. A write to or read from the second register transmits or receives a

> > message. Usually, the first 64 bit register is used for the message itself and

> > 8 bits of the second register are used as an endpoint. I originally considered

> > to have the endpoint exposed as a mailbox-channel, but finally decided against

> > it: The hardware itself only provides a single channel to the co-processor and

> > the endpoint bits are only an implementation detail of the firmware. There's

> > even one co-processor (SEP) which uses 8 bits of the first register as its

> > endpoint number instead.

> > There was a similar discussion about the BCM2835 / Raspberry Pi mailboxes

> > which came to the same conclusion [1].

> >

> > These mailboxes also have a hardware FIFO which make implementing them with the

> > current mailbox a bit tricky: There is no "transmission done" interrupt because

> > most transmissions are "done" immediately. There is only a "transmission fifo

> > empty" level interrupt. I have instead implemented this by adding a fast-path to

> > the core mailbox code as a new txready_fifo mode.

> > The other possibilities (which would not require any changes to the core mailbox

> > code) are to either use the polling mode or to enable the "tx fifo empty"

> > interrupt in send_message and then call txready from the irq handler before

> > disabling it again. I'd like to avoid those though since so far I've never seen

> > the TX FIFO run full which allows to almost always avoid the context switch when

> > sending a message. I can easily switch to one of these modes if you prefer to

> > keep the core code untouched though.

> >

> Yes, please keep the api unchanged.

> Let us please not dig our own tunnels when the existing ways serve the purpose.

> 


Ok, I'll use txdone_irq for v2 then and just ignore the HW FIFO.

Thanks,

Sven