[RFC,net-next,0/8] ethtool: Add ability to control transceiver modules

Message ID	20210809102152.719961-1-idosch@idosch.org
Headers	show Return-Path: <netdev-owner@kernel.org> From: Ido Schimmel <idosch@idosch.org> To: netdev@vger.kernel.org Cc: davem@davemloft.net, kuba@kernel.org, andrew@lunn.ch, mkubecek@suse.cz, pali@kernel.org, vadimp@nvidia.com, mlxsw@nvidia.com, Ido Schimmel <idosch@nvidia.com> Subject: [RFC PATCH net-next 0/8] ethtool: Add ability to control transceiver modules Date: Mon, 9 Aug 2021 13:21:44 +0300 Message-Id: <20210809102152.719961-1-idosch@idosch.org> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	ethtool: Add ability to control transceiver modules \| expand [RFC,net-next,0/8] ethtool: Add ability to control transceiver modules [RFC,net-next,1/8] ethtool: Add ability to control transceiver modules' low power mode [RFC,net-next,2/8] ethtool: Add ability to reset transceiver modules [RFC,net-next,3/8] mlxsw: reg: Add fields to PMAOS register [RFC,net-next,4/8] mlxsw: Make PMAOS pack function more generic [RFC,net-next,5/8] mlxsw: reg: Add Port Module Memory Map Properties register [RFC,net-next,6/8] mlxsw: reg: Add Management Cable IO and Notifications register [RFC,net-next,7/8] mlxsw: Add ability to control transceiver modules' low power mode [RFC,net-next,8/8] mlxsw: Add ability to reset transceiver modules

Ido Schimmel Aug. 9, 2021, 10:21 a.m. UTC

From: Ido Schimmel <idosch@nvidia.com>

This patchset extends the ethtool netlink API to allow user space to
control transceiver modules. Two specific APIs are added, but the plan
is to extend the interface with more APIs in the future (see "Future
plans").

This submission is a complete rework of a previous submission [1] that
tried to achieve the same goal by allowing user space to write to the
EEPROMs of these modules. It was rejected as it could have enabled user
space binary blob drivers.

However, the main issue is that by directly writing to some pages of
these EEPROMs, we are interfering with the entity that is controlling
the modules (kernel / device firmware). In addition, some functionality
cannot be implemented solely by writing to the EEPROM, as it requires
the assertion / de-assertion of hardware signals (e.g., "ResetL" pin in
SFF-8636).

Motivation
==========

The kernel can currently dump the contents of module EEPROMs to user
space via the ethtool legacy ioctl API or the new netlink API. These
dumps can then be parsed by ethtool(8) according to the specification
that defines the memory map of the EEPROM. For example, SFF-8636 [2] for
QSFP and CMIS [3] for QSFP-DD.

In addition to read-only elements, these specifications also define
writeable elements that can be used to control the behavior of the
module. For example, controlling whether the module is put in low or
high power mode to limit its power consumption.

The CMIS specification even defines a message exchange mechanism (CDB,
Command Data Block) on top of the module's memory map. This allows the
host to send various commands to the module. For example, to update its
firmware.

Implementation
==============

The ethtool netlink API is extended with two new messages,
'ETHTOOL_MSG_MODULE_SET' and 'ETHTOOL_MSG_MODULE_GET', that allow user
space to set and get transceiver module parameters. Specifically, the
'ETHTOOL_A_MODULE_LOW_POWER_ENABLED' attribute allows user space to
control the low power mode of the module in order to limit its power
consumption. See detailed description in patch #1.

In addition, patch #2 adds the 'ETHTOOL_MSG_MODULE_RESET_ACT' message
(with a corresponding notification) that allows user space to reset the
module in order to get out of fault state.

The user API is designed to be generic enough so that it could be used
for modules with different memory maps (e.g., SFF-8636, CMIS).

The only implementation of the device driver API in this series is for a
MAC driver (mlxsw) where the module is controlled by the device's
firmware, but it is designed to be generic enough so that it could also
be used by implementations where the module is controlled by the kernel.

Testing and introspection
=========================

See detailed description in patches #1 and #2.

Patchset overview
=================

Patch #1 adds the initial infrastructure in ethtool along with the
ability to control transceiver modules' low power mode.

Patch #2 adds the ability to reset transceiver modules.

Patches #3-#6 add required device registers in mlxsw.

Patch #7 implements in mlxsw the ethtool operations added in patch #1.

Patch #8 implements in mlxsw the ethtool operation added in patch #2.

Future plans
============

* Extend 'ETHTOOL_MSG_MODULE_SET' to control Tx output among other
attributes.

* Add new ethtool message(s) to update firmware on transceiver modules.

* Extend ethtool(8) to parse more diagnostic information from CMIS
modules. No kernel changes required.

[1] https://lore.kernel.org/netdev/20210623075925.2610908-1-idosch@idosch.org/
[2] https://members.snia.org/document/dl/26418
[3] http://www.qsfp-dd.com/wp-content/uploads/2021/05/CMIS5p0.pdf

Ido Schimmel (8):
  ethtool: Add ability to control transceiver modules' low power mode
  ethtool: Add ability to reset transceiver modules
  mlxsw: reg: Add fields to PMAOS register
  mlxsw: Make PMAOS pack function more generic
  mlxsw: reg: Add Port Module Memory Map Properties register
  mlxsw: reg: Add Management Cable IO and Notifications register
  mlxsw: Add ability to control transceiver modules' low power mode
  mlxsw: Add ability to reset transceiver modules

 Documentation/networking/ethtool-netlink.rst  |  83 +++++-
 .../net/ethernet/mellanox/mlxsw/core_env.c    | 232 ++++++++++++++-
 .../net/ethernet/mellanox/mlxsw/core_env.h    |  11 +
 drivers/net/ethernet/mellanox/mlxsw/minimal.c |  34 +++
 drivers/net/ethernet/mellanox/mlxsw/reg.h     | 140 +++++++++-
 .../mellanox/mlxsw/spectrum_ethtool.c         |  68 +++++
 include/linux/ethtool.h                       |  12 +
 include/uapi/linux/ethtool_netlink.h          |  18 ++
 net/ethtool/Makefile                          |   2 +-
 net/ethtool/module.c                          | 264 ++++++++++++++++++
 net/ethtool/netlink.c                         |  26 ++
 net/ethtool/netlink.h                         |   6 +
 12 files changed, 887 insertions(+), 9 deletions(-)
 create mode 100644 net/ethtool/module.c

Andrew Lunn Aug. 9, 2021, 2:28 p.m. UTC | #1

On Mon, Aug 09, 2021 at 01:21:45PM +0300, Ido Schimmel wrote:
> From: Ido Schimmel <idosch@nvidia.com>
> 
> Add a pair of new ethtool messages, 'ETHTOOL_MSG_MODULE_SET' and
> 'ETHTOOL_MSG_MODULE_GET', that can be used to control transceiver
> modules parameters and retrieve their status.

Hi Ido

I've not read all the patchset yet, but i like the general direction.

> The first parameter to control is the low power mode of the module. It
> is only relevant for paged memory modules, as flat memory modules always
> operate in low power mode.
> 
> When a paged memory module is in low power mode, its power consumption
> is reduced to the minimum, the management interface towards the host is
> available and the data path is deactivated.
> 
> User space can choose to put modules that are not currently in use in
> low power mode and transition them to high power mode before putting the
> associated ports administratively up.
> 
> Transitioning into low power mode means loss of carrier, so error is
> returned when the netdev is administratively up.

However, i don't get this use case. With copper PHYs, putting the link
administratively down results in a call into phylib and into the
driver to down the link. This effectively puts the PHY into a low
power mode. The management interface, as defined by C22 and C45 remain
available, but the data path is disabled. For a 1G PHY, this can save
a few watts.

For SFPs managed by phylink and the kernal SFP driver, the exact same
happens. The TX_ENABLE pin of the SFP is set to false. The I2C bus
still works, but the data path is disabled.

So i would expect a driver using firmware, not Linux code to manage
SFPs, to just do this on link down. Why do we need user space
involved?

    Andrew

Ido Schimmel Aug. 10, 2021, 7:26 a.m. UTC | #2

On Mon, Aug 09, 2021 at 04:28:32PM +0200, Andrew Lunn wrote:
> On Mon, Aug 09, 2021 at 01:21:45PM +0300, Ido Schimmel wrote:

> > From: Ido Schimmel <idosch@nvidia.com>

> > 

> > Add a pair of new ethtool messages, 'ETHTOOL_MSG_MODULE_SET' and

> > 'ETHTOOL_MSG_MODULE_GET', that can be used to control transceiver

> > modules parameters and retrieve their status.

> 

> Hi Ido

> 

> I've not read all the patchset yet, but i like the general direction.

> 

> > The first parameter to control is the low power mode of the module. It

> > is only relevant for paged memory modules, as flat memory modules always

> > operate in low power mode.

> > 

> > When a paged memory module is in low power mode, its power consumption

> > is reduced to the minimum, the management interface towards the host is

> > available and the data path is deactivated.

> > 

> > User space can choose to put modules that are not currently in use in

> > low power mode and transition them to high power mode before putting the

> > associated ports administratively up.

> > 

> > Transitioning into low power mode means loss of carrier, so error is

> > returned when the netdev is administratively up.

> 

> However, i don't get this use case. With copper PHYs, putting the link

> administratively down results in a call into phylib and into the

> driver to down the link. This effectively puts the PHY into a low

> power mode. The management interface, as defined by C22 and C45 remain

> available, but the data path is disabled. For a 1G PHY, this can save

> a few watts.

> 

> For SFPs managed by phylink and the kernal SFP driver, the exact same

> happens. The TX_ENABLE pin of the SFP is set to false. The I2C bus

> still works, but the data path is disabled.

> 

> So i would expect a driver using firmware, not Linux code to manage

> SFPs, to just do this on link down. Why do we need user space

> involved?


The transition from low power to high power can take a few seconds with
QSFP/QSFP-DD and it's likely to only get longer with future / more
complex modules. Therefore, to reduce link-up time, the firmware
automatically transitions modules to high power mode.

There is obviously a trade-off here between power consumption and
link-up time. My understanding is that Mellanox is not the only vendor
favoring shorter link-up times as users have the ability to control the
low power mode of the modules in other implementations.

Regarding "why do we need user space involved?", by default, it does not
need to be involved (the system works without this API), but if it wants
to reduce the power consumption by setting unused modules to low power
mode, then it will need to use this API.

Ido Schimmel Aug. 10, 2021, 1:05 p.m. UTC | #3

On Mon, Aug 09, 2021 at 09:13:47PM +0200, Andrew Lunn wrote:
> On Mon, Aug 09, 2021 at 01:21:46PM +0300, Ido Schimmel wrote:

> > From: Ido Schimmel <idosch@nvidia.com>

> > 

> > Add a new ethtool message, 'ETHTOOL_MSG_MODULE_RESET_ACT', which allows

> > user space to request a reset of transceiver modules. A successful reset

> > results in a notification being emitted to user space in the form of a

> > 'ETHTOOL_MSG_MODULE_RESET_NTF' message.

> > 

> > Reset can be performed by either asserting the relevant hardware signal

> > ("Reset" in CMIS / "ResetL" in SFF-8636) or by writing to the relevant

> > reset bit in the module's EEPROM (page 00h, byte 26, bit 3 in CMIS /

> > page 00h, byte 93, bit 7 in SFF-8636).

> > 

> > Reset is useful in order to allow a module to transition out of a fault

> > state. From section 6.3.2.12 in CMIS 5.0: "Except for a power cycle, the

> > only exit path from the ModuleFault state is to perform a module reset

> > by taking an action that causes the ResetS transition signal to become

> > TRUE (see Table 6-11)".

> > 

> > To avoid changes to the operational state of the device, reset can only

> > be performed when the device is administratively down.

> > 

> > Example usage:

> > 

> >  # ethtool --reset-module swp11

> >  netlink error: Cannot reset module when port is administratively up

> >  netlink error: Invalid argument

> > 

> >  # ip link set dev swp11 down

> > 

> >  # ethtool --reset-module swp11

> > 

> > Monitor notifications:

> > 

> >  $ ethtool --monitor

> >  listening...

> > 

> >  Module reset done for swp11

> 

> Again, i'm wondering, why is user space doing the reset? Can you think

> of any other piece of hardware where Linux relies on user space

> performing a reset before the kernel can properly use it?

> 

> How long does a reset take? Table 10-1 says the reset pulse must be

> 10uS and table 10-2 says the reset should not take longer than

> 2000ms.


Takes about 1.5ms to get an ACK on the reset request and another few
seconds to ensure module is in a valid operational state (will remove
RTNL in next version).

> So maybe reset it on ifup if it is in a bad state?


We can have multiple ports (split) using the same module and in CMIS
each data path is controlled by a different state machine. Given the
complexity of these modules and possible faults, it is possible to
imagine a situation in which a few ports are fine and the rest are
unable to obtain a carrier.

Resetting the module on ifup of swp1s0 is not intuitive and it shouldn't
affect other split ports (e.g., swp1s1). With the dedicated reset
command we have the ability to enforce all the required restrictions
from the start instead of changing the behavior of existing commands.

> I assume the driver/firmware is monitoring the SFP and if it goes into

> a state which requires a reset it indicates carrier down? Wasn't there

> some patches which added link down reasons? It would make sense to add

> enum ethtool_link_ext_substate_sfp_fault? You can then use ethtool to

> see what state the module is in, and a down/ip should reset it?


I will look into extending the interface with more reasons and parse the
CMIS ModuleFaultCause (see table 8-15) in ethtool(8).

Andrew Lunn Aug. 10, 2021, 1:52 p.m. UTC | #4

> The transition from low power to high power can take a few seconds with

> QSFP/QSFP-DD and it's likely to only get longer with future / more

> complex modules. Therefore, to reduce link-up time, the firmware

> automatically transitions modules to high power mode.

> 

> There is obviously a trade-off here between power consumption and

> link-up time. My understanding is that Mellanox is not the only vendor

> favoring shorter link-up times as users have the ability to control the

> low power mode of the modules in other implementations.

> 

> Regarding "why do we need user space involved?", by default, it does not

> need to be involved (the system works without this API), but if it wants

> to reduce the power consumption by setting unused modules to low power

> mode, then it will need to use this API.


O.K. Thanks for the better explanation. Some of this should go into
the commit message.

I suggest it gets a different name and semantics, to avoid
confusion. I think we should consider this the default power mode for
when the link is administratively down, rather than direct control
over the modules power mode. The driver should transition the module
to this setting on link down, be it high power or low power. That
saves a lot of complexity, since i assume you currently need a udev
script or something which sets it to low power mode on link down,
where as you can avoid this be configuring the default and let the
driver do it.

I also wonder if a hierarchy is needed? You can set the default for
the switch, and then override is per module? I _guess_ most users will
decide at a switch level they want to save power and pay the penalty
over longer link up times. But then we have the question, is it an
ethtool option, or a devlink parameter?

	Andrew

Jakub Kicinski Aug. 10, 2021, 1:54 p.m. UTC | #5

On Tue, 10 Aug 2021 16:05:07 +0300 Ido Schimmel wrote:
> > Again, i'm wondering, why is user space doing the reset? Can you think

> > of any other piece of hardware where Linux relies on user space

> > performing a reset before the kernel can properly use it?

> > 

> > How long does a reset take? Table 10-1 says the reset pulse must be

> > 10uS and table 10-2 says the reset should not take longer than

> > 2000ms.  

> 

> Takes about 1.5ms to get an ACK on the reset request and another few

> seconds to ensure module is in a valid operational state (will remove

> RTNL in next version).


Hm. RTNL-lock-less ethtool ops are a little concerning. The devlink
locking was much complicated by the unclear locking rules. Can we keep
ethtool simple and put this functionality in a different API or make
the reset async?

> > So maybe reset it on ifup if it is in a bad state?  

> 

> We can have multiple ports (split) using the same module and in CMIS

> each data path is controlled by a different state machine. Given the

> complexity of these modules and possible faults, it is possible to

> imagine a situation in which a few ports are fine and the rest are

> unable to obtain a carrier.

> 

> Resetting the module on ifup of swp1s0 is not intuitive and it shouldn't

> affect other split ports (e.g., swp1s1). With the dedicated reset

> command we have the ability to enforce all the required restrictions

> from the start instead of changing the behavior of existing commands.


Sounds similar to what ethtool --reset was trying to address (upper
16 bits). Could we reuse that? Whether its a SFP or other part of the
port being reset may not be entirely important to the user, so perhaps
it's not a bad idea to abstract that away and stick to "reset levels"?

Jakub Kicinski Aug. 10, 2021, 1:59 p.m. UTC | #6

On Tue, 10 Aug 2021 15:52:20 +0200 Andrew Lunn wrote:
> > The transition from low power to high power can take a few seconds with

> > QSFP/QSFP-DD and it's likely to only get longer with future / more

> > complex modules. Therefore, to reduce link-up time, the firmware

> > automatically transitions modules to high power mode.

> > 

> > There is obviously a trade-off here between power consumption and

> > link-up time. My understanding is that Mellanox is not the only vendor

> > favoring shorter link-up times as users have the ability to control the

> > low power mode of the modules in other implementations.

> > 

> > Regarding "why do we need user space involved?", by default, it does not

> > need to be involved (the system works without this API), but if it wants

> > to reduce the power consumption by setting unused modules to low power

> > mode, then it will need to use this API.  

> 

> O.K. Thanks for the better explanation. Some of this should go into

> the commit message.

> 

> I suggest it gets a different name and semantics, to avoid

> confusion. I think we should consider this the default power mode for

> when the link is administratively down, rather than direct control

> over the modules power mode. The driver should transition the module

> to this setting on link down, be it high power or low power. That

> saves a lot of complexity, since i assume you currently need a udev

> script or something which sets it to low power mode on link down,

> where as you can avoid this be configuring the default and let the

> driver do it.


Good point. And actually NICs have similar knobs, exposed via ethtool
priv flags today. Intel NICs for example. Maybe we should create a
"really power the port down policy" API?

Jake do you know what the use cases for Intel are? Are they SFP, MAC,
or NC-SI related?

> I also wonder if a hierarchy is needed? You can set the default for

> the switch, and then override is per module? I _guess_ most users will

> decide at a switch level they want to save power and pay the penalty

> over longer link up times. But then we have the question, is it an

> ethtool option, or a devlink parameter?

Ido Schimmel Aug. 10, 2021, 6:15 p.m. UTC | #7

On Tue, Aug 10, 2021 at 06:54:23AM -0700, Jakub Kicinski wrote:
> On Tue, 10 Aug 2021 16:05:07 +0300 Ido Schimmel wrote:

> > > Again, i'm wondering, why is user space doing the reset? Can you think

> > > of any other piece of hardware where Linux relies on user space

> > > performing a reset before the kernel can properly use it?

> > > 

> > > How long does a reset take? Table 10-1 says the reset pulse must be

> > > 10uS and table 10-2 says the reset should not take longer than

> > > 2000ms.  

> > 

> > Takes about 1.5ms to get an ACK on the reset request and another few

> > seconds to ensure module is in a valid operational state (will remove

> > RTNL in next version).

> 

> Hm. RTNL-lock-less ethtool ops are a little concerning. The devlink

> locking was much complicated by the unclear locking rules. Can we keep

> ethtool simple and put this functionality in a different API or make

> the reset async?


I thought there are already RTNL-lock-less ethtool ops, but maybe I
imagined it. Given that we also want to support firmware update on
modules and that user space might want to update several modules
simultaneously, do you have a suggestion on how to handle it from
locking perspective? The ethtool netlink backend has parallel ops, but
RTNL is a problem. Firmware flashing is currently synchronous in both
ethtool and devlink, but the latter does not hold RTNL.

> 

> > > So maybe reset it on ifup if it is in a bad state?  

> > 

> > We can have multiple ports (split) using the same module and in CMIS

> > each data path is controlled by a different state machine. Given the

> > complexity of these modules and possible faults, it is possible to

> > imagine a situation in which a few ports are fine and the rest are

> > unable to obtain a carrier.

> > 

> > Resetting the module on ifup of swp1s0 is not intuitive and it shouldn't

> > affect other split ports (e.g., swp1s1). With the dedicated reset

> > command we have the ability to enforce all the required restrictions

> > from the start instead of changing the behavior of existing commands.

> 

> Sounds similar to what ethtool --reset was trying to address (upper

> 16 bits). Could we reuse that? Whether its a SFP or other part of the

> port being reset may not be entirely important to the user, so perhaps

> it's not a bad idea to abstract that away and stick to "reset levels"?


Wasn't aware of this API. Looks like it is only implemented by a few
drivers, but man page says "phy    Transceiver/PHY", so I think we can
reuse it.

What do you mean by "reset levels"? The split between shared /
dedicated?

Just to make sure I understand, you suggest the following semantics?

# ethtool --reset swp1s0 phy
Error since module is used by several ports (split)

# ethtool --reset swp1s0 phy-shared
OK

# ethtool --reset swp1 phy
OK (no split)

# ethtool --reset swp1 phy-shared
OK

Andrew Lunn Aug. 10, 2021, 6:58 p.m. UTC | #8

> I thought there are already RTNL-lock-less ethtool ops, but maybe I

> imagined it. Given that we also want to support firmware update on

> modules and that user space might want to update several modules

> simultaneously, do you have a suggestion on how to handle it from

> locking perspective?


I had a similar problem for Ethernet cable testing. It takes a few
seconds to perform a cable test, and you don't want to hold the RTNL
for that, if you can help it. I made changes to the PHY state machine,
so it has an additional state, indicating a cable test is in
operation. The ethtool netlink op simply starts the cable test and
then returns. Once the cable test is complete, it async reports the
results to user space.

So for module upgrade, you probably need to add a per module state
machine. Changes to the state need to hold RTNL, but you can release
it between state changes.

   Andrew

Jakub Kicinski Aug. 10, 2021, 7 p.m. UTC | #9

On Tue, 10 Aug 2021 21:15:45 +0300 Ido Schimmel wrote:
> On Tue, Aug 10, 2021 at 06:54:23AM -0700, Jakub Kicinski wrote:

> > On Tue, 10 Aug 2021 16:05:07 +0300 Ido Schimmel wrote:  

> > > Takes about 1.5ms to get an ACK on the reset request and another few

> > > seconds to ensure module is in a valid operational state (will remove

> > > RTNL in next version).  

> > 

> > Hm. RTNL-lock-less ethtool ops are a little concerning. The devlink

> > locking was much complicated by the unclear locking rules. Can we keep

> > ethtool simple and put this functionality in a different API or make

> > the reset async?  

> 

> I thought there are already RTNL-lock-less ethtool ops, but maybe I

> imagined it. Given that we also want to support firmware update on

> modules and that user space might want to update several modules

> simultaneously, do you have a suggestion on how to handle it from

> locking perspective?


Hm, flashing is harder than reset. We can't unbind the driver while
it's in progress. I don't have any ready solution in mind, but I'd 
like to make sure the locking is clear and hard to get wrong. Maybe 
we could have a mix of ops, one called for "preparing" the flashing
called under rtnl and another for "commit" with "unlocked" in the name.
Drivers which don't want to deal with dropping rtnl lock can just do
everything in the first stage? Perhaps Andrew has better ideas, I'm
just spit-balling. Presumably there are already locks at play, locks
we would have to take in the case where Linux manages the PHY. Maybe
they dictate an architecture?

> The ethtool netlink backend has parallel ops, but

> RTNL is a problem. Firmware flashing is currently synchronous in both

> ethtool and devlink, but the latter does not hold RTNL.


Yeah, drivers dropping rtnl_lock mid way thru the ethtool flashing op
was one of my main motivations for moving it into devlink.

> > > We can have multiple ports (split) using the same module and in CMIS

> > > each data path is controlled by a different state machine. Given the

> > > complexity of these modules and possible faults, it is possible to

> > > imagine a situation in which a few ports are fine and the rest are

> > > unable to obtain a carrier.

> > > 

> > > Resetting the module on ifup of swp1s0 is not intuitive and it shouldn't

> > > affect other split ports (e.g., swp1s1). With the dedicated reset

> > > command we have the ability to enforce all the required restrictions

> > > from the start instead of changing the behavior of existing commands.  

> > 

> > Sounds similar to what ethtool --reset was trying to address (upper

> > 16 bits). Could we reuse that? Whether its a SFP or other part of the

> > port being reset may not be entirely important to the user, so perhaps

> > it's not a bad idea to abstract that away and stick to "reset levels"?  

> 

> Wasn't aware of this API. Looks like it is only implemented by a few

> drivers, but man page says "phy    Transceiver/PHY", so I think we can

> reuse it.

> 

> What do you mean by "reset levels"? The split between shared /

> dedicated?


Indeed.

> Just to make sure I understand, you suggest the following semantics?

> 

> # ethtool --reset swp1s0 phy

> Error since module is used by several ports (split)

> 

> # ethtool --reset swp1s0 phy-shared

> OK

> 

> # ethtool --reset swp1 phy

> OK (no split)

> 

> # ethtool --reset swp1 phy-shared

> OK


Exactly.

Andrew Lunn Aug. 10, 2021, 7:28 p.m. UTC | #10

> Hm, flashing is harder than reset. We can't unbind the driver while

> it's in progress. I don't have any ready solution in mind, but I'd 

> like to make sure the locking is clear and hard to get wrong. Maybe 

> we could have a mix of ops, one called for "preparing" the flashing

> called under rtnl and another for "commit" with "unlocked" in the name.

> Drivers which don't want to deal with dropping rtnl lock can just do

> everything in the first stage? Perhaps Andrew has better ideas, I'm

> just spit-balling. Presumably there are already locks at play, locks

> we would have to take in the case where Linux manages the PHY. Maybe

> they dictate an architecture?


I don't think the way linux manages PHYs dictates the
architecture. PHY cable test requires that the link is
administratively up, so the PHY state machine is in play. It
transitions into a testing state when cable test is started, and when
the test is finished, it resets the PHY to put it back into running
state. If you down the interface while the cable test is running, it
aborts the cable test, and then downs the PHY.

Flashing firmware is a bit different. You need to ensure the interface
is down. And i guess that gets interesting with split modules. You
really should not abort an upgrade because the user wants to up the
interface. So -EBUSY to open() seems like the best option, based on
the state of the SFP state machine.

I suspect you are going to need a kernel thread to do the real
work. So your "prepare" netlink op would pass the name of the firmware
file. Some basic validation would be performed, that all the needed
interfaces are down etc, and then the netlink OP would return. The
thread then uses request_firmware() to get access to the firmware, and
program it. Once complete, or on error, it can async notify user space
that it is sorry, your module is toast, or firmware upgrade was
successful.

This is just throwing out ideas...

     Andrew

Ido Schimmel Aug. 10, 2021, 8:46 p.m. UTC | #11

On Tue, Aug 10, 2021 at 06:59:54AM -0700, Jakub Kicinski wrote:
> On Tue, 10 Aug 2021 15:52:20 +0200 Andrew Lunn wrote:

> > > The transition from low power to high power can take a few seconds with

> > > QSFP/QSFP-DD and it's likely to only get longer with future / more

> > > complex modules. Therefore, to reduce link-up time, the firmware

> > > automatically transitions modules to high power mode.

> > > 

> > > There is obviously a trade-off here between power consumption and

> > > link-up time. My understanding is that Mellanox is not the only vendor

> > > favoring shorter link-up times as users have the ability to control the

> > > low power mode of the modules in other implementations.

> > > 

> > > Regarding "why do we need user space involved?", by default, it does not

> > > need to be involved (the system works without this API), but if it wants

> > > to reduce the power consumption by setting unused modules to low power

> > > mode, then it will need to use this API.  

> > 

> > O.K. Thanks for the better explanation. Some of this should go into

> > the commit message.

> > 

> > I suggest it gets a different name and semantics, to avoid

> > confusion. I think we should consider this the default power mode for

> > when the link is administratively down, rather than direct control

> > over the modules power mode. The driver should transition the module

> > to this setting on link down, be it high power or low power. That

> > saves a lot of complexity, since i assume you currently need a udev

> > script or something which sets it to low power mode on link down,

> > where as you can avoid this be configuring the default and let the

> > driver do it.

> 

> Good point. And actually NICs have similar knobs, exposed via ethtool

> priv flags today. Intel NICs for example. Maybe we should create a

> "really power the port down policy" API?

See below about Intel. I'm not sure it's the same thing...

I'm against adding a vague "really power the port down policy" API. The
API proposed in the patch is well-defined, its implementation is
documented in standards, its implications are clear and we offer APIs
that give user space full observability into its operation.

A vague API means that it is going to be abused and user space will get
different results over different implementations. After reading the
*commit messages* about the private flags, I'm not sure what the flags
really do, what is their true motivation, implications or how do I get
observability into their operation. I'm not too hopeful about the user
documentation.

Also, like I mentioned in the cover letter, given the complexity of
these modules and as they become more common, it is likely that we will
need to extend the API to control more parameters and expose more
diagnostic information. I would really like to keep it clean and
contained in 'ETHTOOL_MSG_MODULE_*' messages and not spread it over
different APIs.

> 

> Jake do you know what the use cases for Intel are? Are they SFP, MAC,

> or NC-SI related?

I went through all the Intel drivers that implement these operations and
I believe you are talking about these commits:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c3880bd159d431d06b687b0b5ab22e24e6ef0070
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d5ec9e2ce41ac198de2ee18e0e529b7ebbc67408
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=ab4ab73fc1ec6dec548fa36c5e383ef5faa7b4c1

There isn't too much information about the motivation, but maybe it has
something to do with multi-host controllers where you want to prevent
one host from taking the physical link down for all the other hosts
sharing it? I remember such issues with mlx5.

> 

> > I also wonder if a hierarchy is needed? You can set the default for

> > the switch, and then override is per module? I _guess_ most users will

> > decide at a switch level they want to save power and pay the penalty

> > over longer link up times. But then we have the question, is it an

> > ethtool option, or a devlink parameter?

Ido Schimmel Aug. 10, 2021, 8:50 p.m. UTC | #12

On Tue, Aug 10, 2021 at 09:28:08PM +0200, Andrew Lunn wrote:
> > Hm, flashing is harder than reset. We can't unbind the driver while

> > it's in progress. I don't have any ready solution in mind, but I'd 

> > like to make sure the locking is clear and hard to get wrong. Maybe 

> > we could have a mix of ops, one called for "preparing" the flashing

> > called under rtnl and another for "commit" with "unlocked" in the name.

> > Drivers which don't want to deal with dropping rtnl lock can just do

> > everything in the first stage? Perhaps Andrew has better ideas, I'm

> > just spit-balling. Presumably there are already locks at play, locks

> > we would have to take in the case where Linux manages the PHY. Maybe

> > they dictate an architecture?

> 

> I don't think the way linux manages PHYs dictates the

> architecture. PHY cable test requires that the link is

> administratively up, so the PHY state machine is in play. It

> transitions into a testing state when cable test is started, and when

> the test is finished, it resets the PHY to put it back into running

> state. If you down the interface while the cable test is running, it

> aborts the cable test, and then downs the PHY.

> 

> Flashing firmware is a bit different. You need to ensure the interface

> is down. And i guess that gets interesting with split modules. You

> really should not abort an upgrade because the user wants to up the

> interface. So -EBUSY to open() seems like the best option, based on

> the state of the SFP state machine.

> 

> I suspect you are going to need a kernel thread to do the real

> work. So your "prepare" netlink op would pass the name of the firmware

> file. Some basic validation would be performed, that all the needed

> interfaces are down etc, and then the netlink OP would return. The

> thread then uses request_firmware() to get access to the firmware, and

> program it. Once complete, or on error, it can async notify user space

> that it is sorry, your module is toast, or firmware upgrade was

> successful.

> 

> This is just throwing out ideas...


Thanks Andrew and Jakub. I will look into these suggestions more closely
when I start working on modules firmware update.

Keller, Jacob E Aug. 10, 2021, 9:38 p.m. UTC | #13

> -----Original Message-----

> From: Jakub Kicinski <kuba@kernel.org>

> Sent: Tuesday, August 10, 2021 7:00 AM

> To: Andrew Lunn <andrew@lunn.ch>; Keller, Jacob E <jacob.e.keller@intel.com>

> Cc: Ido Schimmel <idosch@idosch.org>; netdev@vger.kernel.org;

> davem@davemloft.net; mkubecek@suse.cz; pali@kernel.org;

> vadimp@nvidia.com; mlxsw@nvidia.com; Ido Schimmel <idosch@nvidia.com>

> Subject: Re: [RFC PATCH net-next 1/8] ethtool: Add ability to control transceiver

> modules' low power mode

> 

> On Tue, 10 Aug 2021 15:52:20 +0200 Andrew Lunn wrote:

> > > The transition from low power to high power can take a few seconds with

> > > QSFP/QSFP-DD and it's likely to only get longer with future / more

> > > complex modules. Therefore, to reduce link-up time, the firmware

> > > automatically transitions modules to high power mode.

> > >

> > > There is obviously a trade-off here between power consumption and

> > > link-up time. My understanding is that Mellanox is not the only vendor

> > > favoring shorter link-up times as users have the ability to control the

> > > low power mode of the modules in other implementations.

> > >

> > > Regarding "why do we need user space involved?", by default, it does not

> > > need to be involved (the system works without this API), but if it wants

> > > to reduce the power consumption by setting unused modules to low power

> > > mode, then it will need to use this API.

> >

> > O.K. Thanks for the better explanation. Some of this should go into

> > the commit message.

> >

> > I suggest it gets a different name and semantics, to avoid

> > confusion. I think we should consider this the default power mode for

> > when the link is administratively down, rather than direct control

> > over the modules power mode. The driver should transition the module

> > to this setting on link down, be it high power or low power. That

> > saves a lot of complexity, since i assume you currently need a udev

> > script or something which sets it to low power mode on link down,

> > where as you can avoid this be configuring the default and let the

> > driver do it.

> 

> Good point. And actually NICs have similar knobs, exposed via ethtool

> priv flags today. Intel NICs for example. Maybe we should create a

> "really power the port down policy" API?

> 

> Jake do you know what the use cases for Intel are? Are they SFP, MAC,

> or NC-SI related?



Offhand I don't know. I think we have some requirements documents I can look up. I'll try to get back to you soon if I can find any further information. (Yes, I wish the commit messages gave stronger motivation too...)

Thanks,
Jake

> 

> > I also wonder if a hierarchy is needed? You can set the default for

> > the switch, and then override is per module? I _guess_ most users will

> > decide at a switch level they want to save power and pay the penalty

> > over longer link up times. But then we have the question, is it an

> > ethtool option, or a devlink parameter?

Keller, Jacob E Aug. 10, 2021, 10 p.m. UTC | #14

On 8/10/2021 1:46 PM, Ido Schimmel wrote:
> On Tue, Aug 10, 2021 at 06:59:54AM -0700, Jakub Kicinski wrote:

>> On Tue, 10 Aug 2021 15:52:20 +0200 Andrew Lunn wrote:

>>>> The transition from low power to high power can take a few seconds with

>>>> QSFP/QSFP-DD and it's likely to only get longer with future / more

>>>> complex modules. Therefore, to reduce link-up time, the firmware

>>>> automatically transitions modules to high power mode.

>>>>

>>>> There is obviously a trade-off here between power consumption and

>>>> link-up time. My understanding is that Mellanox is not the only vendor

>>>> favoring shorter link-up times as users have the ability to control the

>>>> low power mode of the modules in other implementations.

>>>>

>>>> Regarding "why do we need user space involved?", by default, it does not

>>>> need to be involved (the system works without this API), but if it wants

>>>> to reduce the power consumption by setting unused modules to low power

>>>> mode, then it will need to use this API.  

>>>

>>> O.K. Thanks for the better explanation. Some of this should go into

>>> the commit message.

>>>

>>> I suggest it gets a different name and semantics, to avoid

>>> confusion. I think we should consider this the default power mode for

>>> when the link is administratively down, rather than direct control

>>> over the modules power mode. The driver should transition the module

>>> to this setting on link down, be it high power or low power. That

>>> saves a lot of complexity, since i assume you currently need a udev

>>> script or something which sets it to low power mode on link down,

>>> where as you can avoid this be configuring the default and let the

>>> driver do it.

>>

>> Good point. And actually NICs have similar knobs, exposed via ethtool

>> priv flags today. Intel NICs for example. Maybe we should create a

>> "really power the port down policy" API?

> 

> See below about Intel. I'm not sure it's the same thing...

> 

> I'm against adding a vague "really power the port down policy" API. The

> API proposed in the patch is well-defined, its implementation is

> documented in standards, its implications are clear and we offer APIs

> that give user space full observability into its operation.

> 

> A vague API means that it is going to be abused and user space will get

> different results over different implementations. After reading the

> *commit messages* about the private flags, I'm not sure what the flags

> really do, what is their true motivation, implications or how do I get

> observability into their operation. I'm not too hopeful about the user

> documentation.

> 

> Also, like I mentioned in the cover letter, given the complexity of

> these modules and as they become more common, it is likely that we will

> need to extend the API to control more parameters and expose more

> diagnostic information. I would really like to keep it clean and

> contained in 'ETHTOOL_MSG_MODULE_*' messages and not spread it over

> different APIs.

> 

>>

>> Jake do you know what the use cases for Intel are? Are they SFP, MAC,

>> or NC-SI related?

> 

> I went through all the Intel drivers that implement these operations and

> I believe you are talking about these commits:

> 

> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c3880bd159d431d06b687b0b5ab22e24e6ef0070

> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d5ec9e2ce41ac198de2ee18e0e529b7ebbc67408

> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=ab4ab73fc1ec6dec548fa36c5e383ef5faa7b4c1

> 

> There isn't too much information about the motivation, but maybe it has

> something to do with multi-host controllers where you want to prevent

> one host from taking the physical link down for all the other hosts

> sharing it? I remember such issues with mlx5.

> 


Ok, I found some more information here. The primary motivation of the
changes in the i40e and ice drivers is from customer requests asking to
have the link go down when the port is administratively disabled. This
is because if the link is down then the switch on the other side will
see the port not having link and will stop trying to send traffic to it.

As far as I can tell, the reason its a flag is because some users wanted
the behavior the other way.

I'm not sure it's really related to the behavior here.

For what it's worth, I'm in favor of containing things like this into
ethtool as well.

Thanks,
Jake

Jakub Kicinski Aug. 10, 2021, 10:05 p.m. UTC | #15

On Tue, 10 Aug 2021 23:46:28 +0300 Ido Schimmel wrote:
> On Tue, Aug 10, 2021 at 06:59:54AM -0700, Jakub Kicinski wrote:

> > On Tue, 10 Aug 2021 15:52:20 +0200 Andrew Lunn wrote:  

> > > O.K. Thanks for the better explanation. Some of this should go into

> > > the commit message.

> > > 

> > > I suggest it gets a different name and semantics, to avoid

> > > confusion. I think we should consider this the default power mode for

> > > when the link is administratively down, rather than direct control

> > > over the modules power mode. The driver should transition the module

> > > to this setting on link down, be it high power or low power. That

> > > saves a lot of complexity, since i assume you currently need a udev

> > > script or something which sets it to low power mode on link down,

> > > where as you can avoid this be configuring the default and let the

> > > driver do it.  

> > 

> > Good point. And actually NICs have similar knobs, exposed via ethtool

> > priv flags today. Intel NICs for example. Maybe we should create a

> > "really power the port down policy" API?  

> 

> See below about Intel. I'm not sure it's the same thing...

> 

> I'm against adding a vague "really power the port down policy" API. The

> API proposed in the patch is well-defined, its implementation is

> documented in standards, its implications are clear and we offer APIs

> that give user space full observability into its operation.

> 

> A vague API means that it is going to be abused and user space will get

> different results over different implementations. After reading the

> *commit messages* about the private flags, I'm not sure what the flags

> really do, what is their true motivation, implications or how do I get

> observability into their operation. I'm not too hopeful about the user

> documentation.

> 

> Also, like I mentioned in the cover letter, given the complexity of

> these modules and as they become more common, it is likely that we will

> need to extend the API to control more parameters and expose more

> diagnostic information. I would really like to keep it clean and

> contained in 'ETHTOOL_MSG_MODULE_*' messages and not spread it over

> different APIs.


The patch is well defined but it doesn't provide user with the answer
to the question "why is the SFP still up if I asked it to be down?"
It's good to match specs closely but Linux may need to reconcile
multiple policies.

IIUC if Intel decides to keep the SFP up for "other" reasons the
situation will look like this:

 $ ethtool --show-module eth0
 Module parameters for eth0:
 low-power true

 # ethtool -m eth0
 Module State                              : 0x03 (ModuleReady)
 LowPwrAllowRequestHW                      : Off
 LowPwrRequestSW                           : Off


IOW the low-power mode is a way for user to express preference to
shut down/keep up the SFP, but it's not necessarily going to be
the only "policy" that matters. If other policies (e.g. NC-SI) express
preference to keep the interface up it will stay up, right?

The LowPwrRequestSW is not directly controlled by low-power && IF_UP.

What I had in mind was something along the lines of a bitmap of reasons
which are allowed to keep the interface up, and for your use case
the reason would be something like SFP_ALWAYS_ON, but other reasons
(well defined) may also keep the ifc up.

That's just to explain what I meant, if it's going to be clear to
everyone that low-power != LowPwrRequestSW I'm fine either way.

Jakub Kicinski Aug. 10, 2021, 10:06 p.m. UTC | #16

On Tue, 10 Aug 2021 22:00:51 +0000 Keller, Jacob E wrote:
> >> Jake do you know what the use cases for Intel are? Are they SFP, MAC,

> >> or NC-SI related?  

> > 

> > I went through all the Intel drivers that implement these operations and

> > I believe you are talking about these commits:

> > 

> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c3880bd159d431d06b687b0b5ab22e24e6ef0070

> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d5ec9e2ce41ac198de2ee18e0e529b7ebbc67408

> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=ab4ab73fc1ec6dec548fa36c5e383ef5faa7b4c1

> > 

> > There isn't too much information about the motivation, but maybe it has

> > something to do with multi-host controllers where you want to prevent

> > one host from taking the physical link down for all the other hosts

> > sharing it? I remember such issues with mlx5.

> >   

> 

> Ok, I found some more information here. The primary motivation of the

> changes in the i40e and ice drivers is from customer requests asking to

> have the link go down when the port is administratively disabled. This

> is because if the link is down then the switch on the other side will

> see the port not having link and will stop trying to send traffic to it.

> 

> As far as I can tell, the reason its a flag is because some users wanted

> the behavior the other way.

> 

> I'm not sure it's really related to the behavior here.

> 

> For what it's worth, I'm in favor of containing things like this into

> ethtool as well.


I think the question was the inverse - why not always shut down the
port if the interface is brought down?

Keller, Jacob E Aug. 10, 2021, 10:18 p.m. UTC | #17

On 8/10/2021 3:06 PM, Jakub Kicinski wrote:
> On Tue, 10 Aug 2021 22:00:51 +0000 Keller, Jacob E wrote:

>>>> Jake do you know what the use cases for Intel are? Are they SFP, MAC,

>>>> or NC-SI related?  

>>>

>>> I went through all the Intel drivers that implement these operations and

>>> I believe you are talking about these commits:

>>>

>>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c3880bd159d431d06b687b0b5ab22e24e6ef0070

>>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d5ec9e2ce41ac198de2ee18e0e529b7ebbc67408

>>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=ab4ab73fc1ec6dec548fa36c5e383ef5faa7b4c1

>>>

>>> There isn't too much information about the motivation, but maybe it has

>>> something to do with multi-host controllers where you want to prevent

>>> one host from taking the physical link down for all the other hosts

>>> sharing it? I remember such issues with mlx5.

>>>   

>>

>> Ok, I found some more information here. The primary motivation of the

>> changes in the i40e and ice drivers is from customer requests asking to

>> have the link go down when the port is administratively disabled. This

>> is because if the link is down then the switch on the other side will

>> see the port not having link and will stop trying to send traffic to it.

>>

>> As far as I can tell, the reason its a flag is because some users wanted

>> the behavior the other way.

>>

>> I'm not sure it's really related to the behavior here.

>>

>> For what it's worth, I'm in favor of containing things like this into

>> ethtool as well.

> 

> I think the question was the inverse - why not always shut down the

> port if the interface is brought down?

> 


That... is a better question yes. Unfortunately so far I haven't found
any argument for not doing this. Only a bit about many requests to have
this behavior. It might just be inertia to maintain current behavior by
default...

Keller, Jacob E Aug. 10, 2021, 10:24 p.m. UTC | #18

On 8/10/2021 3:06 PM, Jakub Kicinski wrote:
> On Tue, 10 Aug 2021 22:00:51 +0000 Keller, Jacob E wrote:

>>>> Jake do you know what the use cases for Intel are? Are they SFP, MAC,

>>>> or NC-SI related?  

>>>

>>> I went through all the Intel drivers that implement these operations and

>>> I believe you are talking about these commits:

>>>

>>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c3880bd159d431d06b687b0b5ab22e24e6ef0070

>>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d5ec9e2ce41ac198de2ee18e0e529b7ebbc67408

>>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=ab4ab73fc1ec6dec548fa36c5e383ef5faa7b4c1

>>>

>>> There isn't too much information about the motivation, but maybe it has

>>> something to do with multi-host controllers where you want to prevent

>>> one host from taking the physical link down for all the other hosts

>>> sharing it? I remember such issues with mlx5.

>>>   

>>

>> Ok, I found some more information here. The primary motivation of the

>> changes in the i40e and ice drivers is from customer requests asking to

>> have the link go down when the port is administratively disabled. This

>> is because if the link is down then the switch on the other side will

>> see the port not having link and will stop trying to send traffic to it.

>>

>> As far as I can tell, the reason its a flag is because some users wanted

>> the behavior the other way.

>>

>> I'm not sure it's really related to the behavior here.

>>

>> For what it's worth, I'm in favor of containing things like this into

>> ethtool as well.

> 

> I think the question was the inverse - why not always shut down the

> port if the interface is brought down?

> 


So far the best I've found after digging is that forcing total shutdown
causes manageability and VF traffic to stop. Other customers want this
traffic to continue even when the PF port is brought down.

Thanks,
Jake

Andrew Lunn Aug. 10, 2021, 10:31 p.m. UTC | #19

On Tue, Aug 10, 2021 at 03:06:36PM -0700, Jakub Kicinski wrote:
> On Tue, 10 Aug 2021 22:00:51 +0000 Keller, Jacob E wrote:

> > >> Jake do you know what the use cases for Intel are? Are they SFP, MAC,

> > >> or NC-SI related?  

> > > 

> > > I went through all the Intel drivers that implement these operations and

> > > I believe you are talking about these commits:

> > > 

> > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c3880bd159d431d06b687b0b5ab22e24e6ef0070

> > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d5ec9e2ce41ac198de2ee18e0e529b7ebbc67408

> > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=ab4ab73fc1ec6dec548fa36c5e383ef5faa7b4c1

> > > 

> > > There isn't too much information about the motivation, but maybe it has

> > > something to do with multi-host controllers where you want to prevent

> > > one host from taking the physical link down for all the other hosts

> > > sharing it? I remember such issues with mlx5.

> > >   

> > 

> > Ok, I found some more information here. The primary motivation of the

> > changes in the i40e and ice drivers is from customer requests asking to

> > have the link go down when the port is administratively disabled. This

> > is because if the link is down then the switch on the other side will

> > see the port not having link and will stop trying to send traffic to it.

> > 

> > As far as I can tell, the reason its a flag is because some users wanted

> > the behavior the other way.

> > 

> > I'm not sure it's really related to the behavior here.

> 

> I think the question was the inverse - why not always shut down the

> port if the interface is brought down?


Humm. Something does not seem right here. I would assume that when you
administratively configure the link down, the SERDES in the MAC would
stop sending anything. So the module has nothing to send. The link
peer SERDES then looses sync, and reports that upwards as carrier
lost.

Does the i40e and ice leave its SERDES running when the link is
configured down? Or is the switch FUBAR and does not consider SERDES
loss of sync as carrier down?

Ido's use case does seem to be different. The link is down. Do we want
to leave the module active, probably sending a bit stream of all 0,
maybe noise, or do we want to power the module down, so it does not
send anything at all.

     Andrew

Andrew Lunn Aug. 10, 2021, 10:51 p.m. UTC | #20

> The patch is well defined but it doesn't provide user with the answer

> to the question "why is the SFP still up if I asked it to be down?"


Can the user even see that the SFP is still up? Does ip link show
give:

3: eth42: <BROADCAST,MULTICAST,LOWER_UP> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000

i.e. LOWER_UP despite DOWN?

Roopa added:

    rtnetlink: add support for protodown reason

Maybe we need the opposite as well?

      Andrew

Keller, Jacob E Aug. 11, 2021, 12:38 a.m. UTC | #21

On 8/10/2021 3:31 PM, Andrew Lunn wrote:
> On Tue, Aug 10, 2021 at 03:06:36PM -0700, Jakub Kicinski wrote:

>> On Tue, 10 Aug 2021 22:00:51 +0000 Keller, Jacob E wrote:

>>>>> Jake do you know what the use cases for Intel are? Are they SFP, MAC,

>>>>> or NC-SI related?  

>>>>

>>>> I went through all the Intel drivers that implement these operations and

>>>> I believe you are talking about these commits:

>>>>

>>>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c3880bd159d431d06b687b0b5ab22e24e6ef0070

>>>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d5ec9e2ce41ac198de2ee18e0e529b7ebbc67408

>>>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=ab4ab73fc1ec6dec548fa36c5e383ef5faa7b4c1

>>>>

>>>> There isn't too much information about the motivation, but maybe it has

>>>> something to do with multi-host controllers where you want to prevent

>>>> one host from taking the physical link down for all the other hosts

>>>> sharing it? I remember such issues with mlx5.

>>>>   

>>>

>>> Ok, I found some more information here. The primary motivation of the

>>> changes in the i40e and ice drivers is from customer requests asking to

>>> have the link go down when the port is administratively disabled. This

>>> is because if the link is down then the switch on the other side will

>>> see the port not having link and will stop trying to send traffic to it.

>>>

>>> As far as I can tell, the reason its a flag is because some users wanted

>>> the behavior the other way.

>>>

>>> I'm not sure it's really related to the behavior here.

>>

>> I think the question was the inverse - why not always shut down the

>> port if the interface is brought down?

> 

> Humm. Something does not seem right here. I would assume that when you

> administratively configure the link down, the SERDES in the MAC would

> stop sending anything. So the module has nothing to send. The link

> peer SERDES then looses sync, and reports that upwards as carrier

> lost.

> 


Right....

> Does the i40e and ice leave its SERDES running when the link is

> configured down? Or is the switch FUBAR and does not consider SERDES

> loss of sync as carrier down?

>



It's not clear to me. I tried to test with the driver, and it looks like
upstream doesn't yet have the link-down-on-close merged into net-next,
so I grabbed our out-of-tree driver.

Interestingly, both ip link show and ethtool do not report link as up
when the device is closed (ip link set enp175f0s0 down).....

So... whatever difference link-down-on-close makes we're definitely not
reporting things up.


I don't have a setup to confirm anything else right now unfortunately,
but I suspect something is wrong with the implementation of
link-down-on-close (at the very least it seems like we should still be
reporting LOWER_UP.... no?)

Unless there's some other weirdness like with QSFP or other
multi-port-single-cable setups?

I even tried adding some VFs and I see that regardless of whether
link-down-on-close is set, I can see link up on the VF....

Hmmmmm.

> Ido's use case does seem to be different. The link is down. Do we want

> to leave the module active, probably sending a bit stream of all 0,

> maybe noise, or do we want to power the module down, so it does not

> send anything at all.

> 

>      Andrew

> 


Right, I think these two cases are very different.

Ido Schimmel Aug. 11, 2021, 11:33 a.m. UTC | #22

On Tue, Aug 10, 2021 at 03:05:44PM -0700, Jakub Kicinski wrote:
> On Tue, 10 Aug 2021 23:46:28 +0300 Ido Schimmel wrote:

> > On Tue, Aug 10, 2021 at 06:59:54AM -0700, Jakub Kicinski wrote:

> > > On Tue, 10 Aug 2021 15:52:20 +0200 Andrew Lunn wrote:  

> > > > O.K. Thanks for the better explanation. Some of this should go into

> > > > the commit message.

> > > > 

> > > > I suggest it gets a different name and semantics, to avoid

> > > > confusion. I think we should consider this the default power mode for

> > > > when the link is administratively down, rather than direct control

> > > > over the modules power mode. The driver should transition the module

> > > > to this setting on link down, be it high power or low power. That

> > > > saves a lot of complexity, since i assume you currently need a udev

> > > > script or something which sets it to low power mode on link down,

> > > > where as you can avoid this be configuring the default and let the

> > > > driver do it.  

> > > 

> > > Good point. And actually NICs have similar knobs, exposed via ethtool

> > > priv flags today. Intel NICs for example. Maybe we should create a

> > > "really power the port down policy" API?  

> > 

> > See below about Intel. I'm not sure it's the same thing...

> > 

> > I'm against adding a vague "really power the port down policy" API. The

> > API proposed in the patch is well-defined, its implementation is

> > documented in standards, its implications are clear and we offer APIs

> > that give user space full observability into its operation.

> > 

> > A vague API means that it is going to be abused and user space will get

> > different results over different implementations. After reading the

> > *commit messages* about the private flags, I'm not sure what the flags

> > really do, what is their true motivation, implications or how do I get

> > observability into their operation. I'm not too hopeful about the user

> > documentation.

> > 

> > Also, like I mentioned in the cover letter, given the complexity of

> > these modules and as they become more common, it is likely that we will

> > need to extend the API to control more parameters and expose more

> > diagnostic information. I would really like to keep it clean and

> > contained in 'ETHTOOL_MSG_MODULE_*' messages and not spread it over

> > different APIs.

> 

> The patch is well defined but it doesn't provide user with the answer

> to the question "why is the SFP still up if I asked it to be down?"

But you didn't ask the module to be "down", you asked the MAC. See more
below.

> It's good to match specs closely but Linux may need to reconcile

> multiple policies.

> 

> IIUC if Intel decides to keep the SFP up for "other" reasons the

> situation will look like this:

Intel did not decide to keep the module "up", it decided to keep both
the MAC and the module up. If one of them was down, the peer would
notice it, but it didn't (according to Jake's reply). This is a very
problematic behavior as you are telling your peer that everything is
fine, but in practice you are dropping all of its packets. I hit the
exact same issue with mlx5 a few years ago and when I investigated the
reason for this behavior I was referred to multi-host systems where you
don't want one host to take down the shared link for all the rest. See:

https://www.mellanox.com/sites/default/files/doc-2020/sb-externally-connected-multi-host.pdf

> 

>  $ ethtool --show-module eth0

>  Module parameters for eth0:

>  low-power true

> 

>  # ethtool -m eth0

>  Module State                              : 0x03 (ModuleReady)

>  LowPwrAllowRequestHW                      : Off

>  LowPwrRequestSW                           : Off

This output is wrong. In Intel's case "ip link show" will report the
link as down (according to Jake's reply) despite the MAC being up. On
the other hand, the output of "ethtool --show-module eth0" will show
that the module is not in low power mode and it will be correct.

> 

> 

> IOW the low-power mode is a way for user to express preference to

> shut down/keep up the SFP,

Yes, it controls the module, not the MAC. If you want to get a carrier,
both the module and the MAC need to be operational. See following
example where swp13 and swp14 are connected to each other:

$ ip link show dev swp13
127: swp13: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP mode DEFAULT group default qlen 1000
    link/ether 1c:34:da:18:55:49 brd ff:ff:ff:ff:ff:ff

$ ip link show dev swp14
128: swp14: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP mode DEFAULT group default qlen 1000
    link/ether 1c:34:da:18:55:4d brd ff:ff:ff:ff:ff:ff

# ip link set dev swp13 down

# ethtool --set-module swp13 low-power on

$ ethtool --show-module swp13
Module parameters for swp13:
low-power true

# ip link set dev swp13 up

$ ip link show dev swp13
127: swp13: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc fq_codel state DOWN mode DEFAULT group default qlen 1000
    link/ether 1c:34:da:18:55:49 brd ff:ff:ff:ff:ff:ff

$ ip link show dev swp14
128: swp14: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc fq_codel state DOWN mode DEFAULT group default qlen 1000
    link/ether 1c:34:da:18:55:4d brd ff:ff:ff:ff:ff:ff

# ip link set dev swp13 down

# ethtool --set-module swp13 low-power off

$ ethtool --show-module swp13
Module parameters for swp13:
low-power false

# ip link set dev swp13 up

$ ip link show dev swp13
127: swp13: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP mode DEFAULT group default qlen 1000
    link/ether 1c:34:da:18:55:49 brd ff:ff:ff:ff:ff:ff

$ ip link show dev swp14
128: swp14: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP mode DEFAULT group default qlen 1000
    link/ether 1c:34:da:18:55:4d brd ff:ff:ff:ff:ff:ff

> but it's not necessarily going to be the only "policy" that matters.

> If other policies (e.g. NC-SI) express preference to keep the

> interface up it will stay up, right?

> 

> The LowPwrRequestSW is not directly controlled by low-power && IF_UP.

> 

> What I had in mind was something along the lines of a bitmap of reasons

> which are allowed to keep the interface up, and for your use case

> the reason would be something like SFP_ALWAYS_ON, but other reasons

> (well defined) may also keep the ifc up.

> 

> That's just to explain what I meant, if it's going to be clear to

> everyone that low-power != LowPwrRequestSW I'm fine either way.

I think we mixed two different use cases here. The first is a way to
make sure the link is fully operational (both MAC and module). In this
case, contrary to the expected behavior, "ip link set dev eth0 down"
will not take the MAC down and the peer will not lose its carrier. This
is most likely motivated by special hardware designs or exotic use cases
like I mentioned above.

The use case for this patch is completely different. Today, when you do
"ip link set dev eth0 up" you expect to gain a carrier which means that
both the MAC and the module are operational. The latter can be made
operational following the user request (e.g., SFP driver) or as soon as
it was plugged-in, by the device's firmware.

When you do "ip link set dev eth0 down" you expect the reverse to happen
and this is what happens today. In implementations where the module is
always operational, it stays in high power mode and continues to get
warm and consume unnecessary amount of power.

Some users might not be OK with that and would like more control, which
is exactly what this patch is doing. This patch does not change existing
behavior, the API has clear semantics and implications and user space
has full observability into its operation.

If in the future someone sees the need to add 'protoup', it can be done:

# ip link set dev eth0 up  # MAC and module are operational
# ip link set dev eth0 protoup on
# ip link set dev eth0 down # returns an error / ignore
# ethtool --set-module eth0 low-power on # returns an error because of admin up

You can obviously engineer situations that do not make any sense. Like
this:

# ethtool --set-module eth0 low-power on
# ip link set dev eth0 up
# ip link set dev eth0 protoup on

The MAC is operational and you can't take it down, but you will never
get a carrier because you explicitly asked the module to stay in low
power mode.

Jakub Kicinski Aug. 11, 2021, 1:03 p.m. UTC | #23

On Wed, 11 Aug 2021 14:33:06 +0300 Ido Schimmel wrote:
> # ethtool --set-module swp13 low-power on

> 

> $ ethtool --show-module swp13

> Module parameters for swp13:

> low-power true

> 

> # ip link set dev swp13 up

> 

> $ ip link show dev swp13

> 127: swp13: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc fq_codel state DOWN mode DEFAULT group default qlen 1000

>     link/ether 1c:34:da:18:55:49 brd ff:ff:ff:ff:ff:ff

> 

> $ ip link show dev swp14

> 128: swp14: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc fq_codel state DOWN mode DEFAULT group default qlen 1000

>     link/ether 1c:34:da:18:55:4d brd ff:ff:ff:ff:ff:ff


Oh, so if we set low-power true the carrier will never show up?
I thought Andrew suggested the setting is only taken into account 
when netdev is down. That made so much sense to me I assumed we'll
just go with that. I must have misunderstood.

Andrew Lunn Aug. 11, 2021, 2:36 p.m. UTC | #24

On Wed, Aug 11, 2021 at 06:03:43AM -0700, Jakub Kicinski wrote:
> On Wed, 11 Aug 2021 14:33:06 +0300 Ido Schimmel wrote:

> > # ethtool --set-module swp13 low-power on

> > 

> > $ ethtool --show-module swp13

> > Module parameters for swp13:

> > low-power true

> > 

> > # ip link set dev swp13 up

> > 

> > $ ip link show dev swp13

> > 127: swp13: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc fq_codel state DOWN mode DEFAULT group default qlen 1000

> >     link/ether 1c:34:da:18:55:49 brd ff:ff:ff:ff:ff:ff

> > 

> > $ ip link show dev swp14

> > 128: swp14: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc fq_codel state DOWN mode DEFAULT group default qlen 1000

> >     link/ether 1c:34:da:18:55:4d brd ff:ff:ff:ff:ff:ff

> 

> Oh, so if we set low-power true the carrier will never show up?

> I thought Andrew suggested the setting is only taken into account 

> when netdev is down.

Yes, that was my intention. If this low power mode also applies when
the interface is admin up, it sounds like a foot gun. ip link show
gives you no idea why the carrier is down, and people will assume the
cable or peer is broken. We at least need a new flag, LOWER_DISABLED
or similar to give the poor user some chance to figure out what is
going on.

To me, this setting should only apply when the link is admin down.

	Andrew

Ido Schimmel Aug. 11, 2021, 7:37 p.m. UTC | #25

On Wed, Aug 11, 2021 at 04:36:13PM +0200, Andrew Lunn wrote:
> On Wed, Aug 11, 2021 at 06:03:43AM -0700, Jakub Kicinski wrote:

> > On Wed, 11 Aug 2021 14:33:06 +0300 Ido Schimmel wrote:

> > > # ethtool --set-module swp13 low-power on

> > > 

> > > $ ethtool --show-module swp13

> > > Module parameters for swp13:

> > > low-power true

> > > 

> > > # ip link set dev swp13 up

> > > 

> > > $ ip link show dev swp13

> > > 127: swp13: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc fq_codel state DOWN mode DEFAULT group default qlen 1000

> > >     link/ether 1c:34:da:18:55:49 brd ff:ff:ff:ff:ff:ff

> > > 

> > > $ ip link show dev swp14

> > > 128: swp14: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc fq_codel state DOWN mode DEFAULT group default qlen 1000

> > >     link/ether 1c:34:da:18:55:4d brd ff:ff:ff:ff:ff:ff

> > 

> > Oh, so if we set low-power true the carrier will never show up?

> > I thought Andrew suggested the setting is only taken into account 

> > when netdev is down.

> 

> Yes, that was my intention. If this low power mode also applies when

> the interface is admin up, it sounds like a foot gun. ip link show

> gives you no idea why the carrier is down, and people will assume the

> cable or peer is broken. We at least need a new flag, LOWER_DISABLED

> or similar to give the poor user some chance to figure out what is

> going on.

The canonical way to report such errors is via LINKSTATE_GET and I will
add an extended sub-state describing the problem.

> 

> To me, this setting should only apply when the link is admin down.

I don't want to bake such an assumption into the kernel, but I have a
suggestion that resolves the issue.

We currently have a single attribute that encodes the desired state on
SET messages and the operational state on GET_REPLY messages
(ETHTOOL_A_MODULE_LOW_POWER_ENABLED):

$ ethtool --show-module swp11
Module parameters for swp11:
low-power true

It is not defined very well when a module is not connected despite being
a very interesting use case. We really need to have two attributes. The
first one describing the power mode policy and the second one describing
the operational power mode which is only reported when a module is
plugged in.

For the policy we can have these values:

1. low: Always transition the module to low power mode
2. high: Always transition the module to high power mode
3. high-on-up: Transition the module to high power mode when a port
using it is administratively up. Otherwise, low

A different policy for up/down seems like an overkill for me.

See example usage below.

No module connected:

$ ethtool --show-module swp11
Module parameters for swp11:
power-mode-policy high

Like I mentioned before, this is the default on Mellanox systems so this
new attribute allows user space to query the default policy.

Change to a different policy:

# ethtool --set-module swp11 power-mode-policy high-on-up

$ ethtool --show-module swp11
Module parameters for swp11:
power-mode-policy high-on-up

After a module was connected:

$ ethtool --show-module swp11
Module parameters for swp11:
power-mode-policy high-on-up
power-mode low

# ip link set dev swp11 up

$ ethtool --show-module swp11
Module parameters for swp11:
power-mode-policy high-on-up
low-power high

# ip link set dev swp11 down

# ethtool --set-module swp11 power-mode-policy low

# ip link set dev swp11 up

$ ethtool swp11
...
Link detected: no (Cable issue, Module is in low power mode)

I'm quite happy with the above. Might change a few things as I implement
it, but you get the gist. WDYT?

Jakub Kicinski Aug. 11, 2021, 8:30 p.m. UTC | #26

On Wed, 11 Aug 2021 22:37:53 +0300 Ido Schimmel wrote:
> On Wed, Aug 11, 2021 at 04:36:13PM +0200, Andrew Lunn wrote:

> > On Wed, Aug 11, 2021 at 06:03:43AM -0700, Jakub Kicinski wrote:  

> > > Oh, so if we set low-power true the carrier will never show up?

> > > I thought Andrew suggested the setting is only taken into account 

> > > when netdev is down.  

> > 

> > Yes, that was my intention. If this low power mode also applies when

> > the interface is admin up, it sounds like a foot gun. ip link show

> > gives you no idea why the carrier is down, and people will assume the

> > cable or peer is broken. We at least need a new flag, LOWER_DISABLED

> > or similar to give the poor user some chance to figure out what is

> > going on.  

> 

> The canonical way to report such errors is via LINKSTATE_GET and I will

> add an extended sub-state describing the problem.

>  

> > To me, this setting should only apply when the link is admin down.  

> 

> I don't want to bake such an assumption into the kernel, but I have a

> suggestion that resolves the issue.

> 

> We currently have a single attribute that encodes the desired state on

> SET messages and the operational state on GET_REPLY messages

> (ETHTOOL_A_MODULE_LOW_POWER_ENABLED):

> 

> $ ethtool --show-module swp11

> Module parameters for swp11:

> low-power true

> 

> It is not defined very well when a module is not connected despite being

> a very interesting use case. We really need to have two attributes. The

> first one describing the power mode policy and the second one describing

> the operational power mode which is only reported when a module is

> plugged in.

> 

> For the policy we can have these values:

> 

> 1. low: Always transition the module to low power mode

> 2. high: Always transition the module to high power mode

> 3. high-on-up: Transition the module to high power mode when a port

> using it is administratively up. Otherwise, low

> 

> A different policy for up/down seems like an overkill for me.

> 

> See example usage below.

> 

> No module connected:

> 

> $ ethtool --show-module swp11

> Module parameters for swp11:

> power-mode-policy high

> 

> Like I mentioned before, this is the default on Mellanox systems so this

> new attribute allows user space to query the default policy.

> 

> Change to a different policy:

> 

> # ethtool --set-module swp11 power-mode-policy high-on-up

> 

> $ ethtool --show-module swp11

> Module parameters for swp11:

> power-mode-policy high-on-up

> 

> After a module was connected:

> 

> $ ethtool --show-module swp11

> Module parameters for swp11:

> power-mode-policy high-on-up

> power-mode low

> 

> # ip link set dev swp11 up

> 

> $ ethtool --show-module swp11

> Module parameters for swp11:

> power-mode-policy high-on-up

> low-power high

> 

> # ip link set dev swp11 down

> 

> # ethtool --set-module swp11 power-mode-policy low

> 

> # ip link set dev swp11 up

> 

> $ ethtool swp11

> ...

> Link detected: no (Cable issue, Module is in low power mode)

> 

> I'm quite happy with the above. Might change a few things as I implement

> it, but you get the gist. WDYT?


Isn't the "low-power" attr just duplicating the relevant bits from -m?

Andrew Lunn Aug. 11, 2021, 8:42 p.m. UTC | #27

> For the policy we can have these values:

> 

> 1. low: Always transition the module to low power mode

> 2. high: Always transition the module to high power mode

> 3. high-on-up: Transition the module to high power mode when a port

> using it is administratively up. Otherwise, low

> 

> A different policy for up/down seems like an overkill for me.


O.K. The current kernel SFP driver is going to default to high-on-up,
which is what kernel driven copper PHYs also do.

> After a module was connected:

> 

> $ ethtool --show-module swp11

> Module parameters for swp11:

> power-mode-policy high-on-up

> power-mode low

> 

> # ip link set dev swp11 up

> 

> $ ethtool --show-module swp11

> Module parameters for swp11:

> power-mode-policy high-on-up

> low-power high

> 

> # ip link set dev swp11 down


You missed a show here. I expect it to be:

> $ ethtool --show-module swp11

> Module parameters for swp11:

> power-mode-policy high-on-up

> power-mode low


since it is now down.

I suppose we should consider the bigger picture. Is this feature
limited to just SFP modules, or does it make sense to any other sort
of network technology? CAN, bluetooth, 5G, IPoAC?

   Andrew

Andrew Lunn Aug. 11, 2021, 8:57 p.m. UTC | #28

> Isn't the "low-power" attr just duplicating the relevant bits from -m?


Do all SFPs report it in the dump? I'm thinking GPON, 1G modules with
a TX_ENABLE pin? INF-8074 does not specify a bit in the 'EEPROM' data
to indicate the status. So you need to know the state of the GPIO
driving the TX_ENABLE pin.

   Andrew

Ido Schimmel Aug. 11, 2021, 9:04 p.m. UTC | #29

On Wed, Aug 11, 2021 at 01:30:06PM -0700, Jakub Kicinski wrote:
> Isn't the "low-power" attr just duplicating the relevant bits from -m?


If low power is forced by hardware, then it depends on the assertion /
de-assertion of the hardware signal which is obviously not part of the
EEPROM dump.

I knew this would come up, so I mentioned it in the commit message:

"The low power mode can be queried from the kernel. In case
LowPwrAllowRequestHW was on, the kernel would need to take into account
the state of the LowPwrRequestHW signal, which is not visible to user
space."

[RFC,net-next,0/8] ethtool: Add ability to control transceiver modules

Message

Comments