diff mbox series

[RFC,1/6] docs: networking: add the document for DFL Ether Group driver

Message ID 1603442745-13085-2-git-send-email-yilun.xu@intel.com
State New
Headers show
Series Add the netdev support for Intel PAC N3000 FPGA | expand

Commit Message

Xu Yilun Oct. 23, 2020, 8:45 a.m. UTC
This patch adds the document for DFL Ether Group driver.

Signed-off-by: Xu Yilun <yilun.xu@intel.com>
---
 .../networking/device_drivers/ethernet/index.rst   |   1 +
 .../ethernet/intel/dfl-eth-group.rst               | 102 +++++++++++++++++++++
 2 files changed, 103 insertions(+)
 create mode 100644 Documentation/networking/device_drivers/ethernet/intel/dfl-eth-group.rst

Comments

Andrew Lunn Oct. 23, 2020, 3:37 p.m. UTC | #1
Hi Xu

Before i look at the other patches, i want to understand the
architecture properly.

> +=======================================================================

> +DFL device driver for Ether Group private feature on Intel(R) PAC N3000

> +=======================================================================

> +

> +This is the driver for Ether Group private feature on Intel(R)

> +PAC (Programmable Acceleration Card) N3000.


I assume this is just one implementation. The FPGA could be placed on
other boards. So some of the limitations you talk about with the BMC
artificial, and the overall architecture of the drivers is more
generic?

> +The Intel(R) PAC N3000 is a FPGA based SmartNIC platform for multi-workload

> +networking application acceleration. A simple diagram below to for the board:

> +

> +                     +----------------------------------------+

> +                     |                  FPGA                  |

> ++----+   +-------+   +-----------+  +----------+  +-----------+   +----------+

> +|QSFP|---|retimer|---|Line Side  |--|User logic|--|Host Side  |---|XL710     |

> ++----+   +-------+   |Ether Group|  |          |  |Ether Group|   |Ethernet  |

> +                     |(PHY + MAC)|  |wiring &  |  |(MAC + PHY)|   |Controller|

> +                     +-----------+  |offloading|  +-----------+   +----------+

> +                     |              +----------+              |

> +                     |                                        |

> +                     +----------------------------------------+


Is XL710 required? I assume any MAC with the correct MII interface
will work?

Do you really mean PHY? I actually expect it is PCS? 

> +The DFL Ether Group driver registers netdev for each line side link. Users

> +could use standard commands (ethtool, ip, ifconfig) for configuration and

> +link state/statistics reading. For host side links, they are always connected

> +to the host ethernet controller, so they should always have same features as

> +the host ethernet controller. There is no need to register netdevs for them.


So lets say the XL710 is eth0. The line side netif is eth1. Where do i
put the IP address? What interface do i add to quagga OSPF? 

> +The driver just enables these links on probe.

> +

> +The retimer chips are managed by onboard BMC (Board Management Controller)

> +firmware, host driver is not capable to access them directly.


What about the QSPF socket? Can the host get access to the I2C bus?
The pins for TX enable, etc. ethtool -m?

> +Speed/Duplex

> +------------

> +The Ether Group doesn't support auto-negotiation. The link speed is fixed to

> +10G, 25G or 40G full duplex according to which Ether Group IP is programmed.


So that means, if i pop out the SFP and put in a different one which
supports a different speed, it is expected to be broken until the FPGA
is reloaded?

     Andrew
Tom Rix Oct. 24, 2020, 2:25 p.m. UTC | #2
On 10/23/20 1:45 AM, Xu Yilun wrote:
> This patch adds the document for DFL Ether Group driver.

>

> Signed-off-by: Xu Yilun <yilun.xu@intel.com>

> ---

>  .../networking/device_drivers/ethernet/index.rst   |   1 +

>  .../ethernet/intel/dfl-eth-group.rst               | 102 +++++++++++++++++++++

>  2 files changed, 103 insertions(+)

>  create mode 100644 Documentation/networking/device_drivers/ethernet/intel/dfl-eth-group.rst

>

> diff --git a/Documentation/networking/device_drivers/ethernet/index.rst b/Documentation/networking/device_drivers/ethernet/index.rst

> index cbb75a18..eb7c443 100644

> --- a/Documentation/networking/device_drivers/ethernet/index.rst

> +++ b/Documentation/networking/device_drivers/ethernet/index.rst

> @@ -26,6 +26,7 @@ Contents:

>     freescale/gianfar

>     google/gve

>     huawei/hinic

> +   intel/dfl-eth-group

>     intel/e100

>     intel/e1000

>     intel/e1000e

> diff --git a/Documentation/networking/device_drivers/ethernet/intel/dfl-eth-group.rst b/Documentation/networking/device_drivers/ethernet/intel/dfl-eth-group.rst

> new file mode 100644

> index 0000000..525807e

> --- /dev/null

> +++ b/Documentation/networking/device_drivers/ethernet/intel/dfl-eth-group.rst

> @@ -0,0 +1,102 @@

> +.. SPDX-License-Identifier: GPL-2.0+

> +

> +=======================================================================

> +DFL device driver for Ether Group private feature on Intel(R) PAC N3000

> +=======================================================================

> +

> +This is the driver for Ether Group private feature on Intel(R)

> +PAC (Programmable Acceleration Card) N3000.

> +

> +The Intel(R) PAC N3000 is a FPGA based SmartNIC platform for multi-workload

> +networking application acceleration. A simple diagram below to for the board:

> +

> +                     +----------------------------------------+

> +                     |                  FPGA                  |

> ++----+   +-------+   +-----------+  +----------+  +-----------+   +----------+

> +|QSFP|---|retimer|---|Line Side  |--|User logic|--|Host Side  |---|XL710     |

> ++----+   +-------+   |Ether Group|  |          |  |Ether Group|   |Ethernet  |

> +                     |(PHY + MAC)|  |wiring &  |  |(MAC + PHY)|   |Controller|

> +                     +-----------+  |offloading|  +-----------+   +----------+

> +                     |              +----------+              |

> +                     |                                        |

> +                     +----------------------------------------+

> +

> +The FPGA is composed of FPGA Interface Module (FIM) and Accelerated Function

> +Unit (AFU). The FIM implements the basic functionalities for FPGA access,

> +management and reprograming, while the AFU is the FPGA reprogramable region for

> +users.

> +

> +The Line Side & Host Side Ether Groups are soft IP blocks embedded in FIM. They

The Line Side and Host Side Ether Groups are soft IP blocks embedded in the FIM.
> +are internally wire connected to AFU and communicate with AFU with MAC packets.

are internally connected to the AFU and communicate with the AFU using MAC packets
> +The user logic is developed by the FPGA users and re-programmed to AFU,

The user logic is application dependent, supplied by the FPGA developer and used to reprogram the AFU.
> +providing the user defined wire connections between line side & host side data

between Line Side and Host Side
> +interfaces, as well as the MAC layer offloading.

> +

> +There are 2 types of interfaces for the Ether Groups:

> +

> +1. The data interfaces connects the Ether Groups and the AFU, host has no

The data interface which connects
> +ability to control the data stream . So the FPGA is like a pipe between the

> +host ethernet controller and the retimer chip.

> +

> +2. The management interfaces connects the Ether Groups to the host, so host

The management interface which connects
> +could access the Ether Group registers for configuration and statistics

> +reading.

> +

> +The Intel(R) PAC N3000 could be programmed to various configurations (with

N3000 can be
> +different link numbers and speeds, e.g. 8x10G, 4x25G ...). It is done by

This is done
> +programing different variants of the Ether Group IP blocks, and doing

> +corresponding configuration to the retimer chips.

programming different variants of the Ether Group IP blocks and retimer configuration.
> +

> +The DFL Ether Group driver registers netdev for each line side link. Users

registers a netdev
> +could use standard commands (ethtool, ip, ifconfig) for configuration and

> +link state/statistics reading. For host side links, they are always connected

> +to the host ethernet controller, so they should always have same features as

> +the host ethernet controller. There is no need to register netdevs for them.

> +The driver just enables these links on probe.

> +

> +The retimer chips are managed by onboard BMC (Board Management Controller)

> +firmware, host driver is not capable to access them directly. So it is mostly


firmware. The host driver

So it behaves like

> +like an external fixed PHY. However the link states detected by the retimer

> +chips can not be propagated to the Ether Groups for hardware limitation, in

Limitations should get there own section, this is going off on tangent.
> +order to manage the link state, a PHY driver (intel-m10-bmc-retimer) is

> +introduced to query the BMC for the retimer's link state. The Ether Group

> +driver would connect to the PHY devices and get the link states. The

> +intel-m10-bmc-retimer driver creates a peseudo MDIO bus for each board, so

> +that the Ether Group driver could find the PHY devices by their peseudo PHY

> +addresses.

> +

> +

> +2. Features supported

> +=====================

> +

> +Data Path

> +---------

> +Since the driver can't control the data stream, the Ether Group driver

> +doesn't implement the valid tx/rx functions. Any transmit attempt on these

> +links from host will be dropped, and no data could be received to host from

links from the host will be dropped.  (you can assume a dropped link will not have data and shorten the sentence)
> +these links. Users should operate on the netdev of host ethernet controller

> +for networking data traffic.

> +

> +

> +Speed/Duplex

> +------------

> +The Ether Group doesn't support auto-negotiation. The link speed is fixed to

does not
> +10G, 25G or 40G full duplex according to which Ether Group IP is programmed.

> +

> +Statistics

> +----------

> +The Ether Group IP has the statistics counters for ethernet traffic and errors.

> +The user can obtain these MAC-level statistics using "ethtool -S" option.

> +

> +MTU

> +---

> +The Ether Group IP is capable of detecting oversized packets. It will not drop

> +the packet but pass it up and increment the tx/rx oversize counters. The MTU

but will pass it and
> +could be changed via ip or ifconfig commands.

> +

> +Flow Control

> +------------

> +Ethernet Flow Control (IEEE 802.3x) can be configured with ethtool to enable

> +transmitting pause frames. Receiving pause request from outside to Ether Group


pausing tx frames. Receiving a pause

Tom

> +MAC is not supported. The flow control auto-negotiation is not supported. The

> +user can enable or disable Tx Flow Control using "ethtool -A eth? tx <on|off>"
Xu Yilun Oct. 26, 2020, 8:52 a.m. UTC | #3
Hi Andrew

Thanks for your fast response, see comments inline.

On Fri, Oct 23, 2020 at 05:37:31PM +0200, Andrew Lunn wrote:
> Hi Xu
> 
> Before i look at the other patches, i want to understand the
> architecture properly.

I have a doc to describe the architecture:

https://www.intel.com/content/www/us/en/programmable/documentation/xgz1560360700260.html

The "Figure 1" is a more detailed figure for the arch. It should be
helpful.

> 
> > +=======================================================================
> > +DFL device driver for Ether Group private feature on Intel(R) PAC N3000
> > +=======================================================================
> > +
> > +This is the driver for Ether Group private feature on Intel(R)
> > +PAC (Programmable Acceleration Card) N3000.
> 
> I assume this is just one implementation. The FPGA could be placed on
> other boards. So some of the limitations you talk about with the BMC
> artificial, and the overall architecture of the drivers is more
> generic?

I could see if the retimer management is changed, e.g. access the retimer
through a host controlled MDIO, maybe I need a more generic way to find the
MDIO bus.

Do you have other suggestions?

> 
> > +The Intel(R) PAC N3000 is a FPGA based SmartNIC platform for multi-workload
> > +networking application acceleration. A simple diagram below to for the board:
> > +
> > +                     +----------------------------------------+
> > +                     |                  FPGA                  |
> > ++----+   +-------+   +-----------+  +----------+  +-----------+   +----------+
> > +|QSFP|---|retimer|---|Line Side  |--|User logic|--|Host Side  |---|XL710     |
> > ++----+   +-------+   |Ether Group|  |          |  |Ether Group|   |Ethernet  |
> > +                     |(PHY + MAC)|  |wiring &  |  |(MAC + PHY)|   |Controller|
> > +                     +-----------+  |offloading|  +-----------+   +----------+
> > +                     |              +----------+              |
> > +                     |                                        |
> > +                     +----------------------------------------+
> 
> Is XL710 required? I assume any MAC with the correct MII interface
> will work?

The XL710 is required for this implementation, in which we have the Host
Side Ether Group facing the host.  The Host Side Ether Group actually
contains the same IP blocks as Line Side. It contains the compacted MAC &
PHY functionalities for 25G/40G case. The 25G MAC-PHY soft IP SPEC can
be found at:

https://www.intel.com/content/www/us/en/programmable/documentation/ewo1447742896786.html

So raw serial data is output from Host Side FPGA, and XL710 is good to
handle this.

> 
> Do you really mean PHY? I actually expect it is PCS? 

For this implementation, yes.

I guess if you program another IP block on FPGA host side, e.g. a PCS interface,
and replace XL710 with another MAC, it may also work. But I think there should
be other drivers to handle this.

I may contact with our Hardware designer if there is some concern we
don't use MII for connection of FPGA & Host.

The FPGA User is mainly concerned about the user logic part. The Ether
Groups in FIU and Board components are not expected to be re-designed by
the user. So I think I should still focus on the driver for this
implementation.

> 
> > +The DFL Ether Group driver registers netdev for each line side link. Users
> > +could use standard commands (ethtool, ip, ifconfig) for configuration and
> > +link state/statistics reading. For host side links, they are always connected
> > +to the host ethernet controller, so they should always have same features as
> > +the host ethernet controller. There is no need to register netdevs for them.
> 
> So lets say the XL710 is eth0. The line side netif is eth1. Where do i
> put the IP address? What interface do i add to quagga OSPF? 

The IP address should be put in eth0. eth0 should always be used for the
tools.

The line/host side Ether Group is not the terminal of the network data stream.
Eth1 will not paticipate in the network data exchange to host.

The main purposes for eth1 are:
1. For users to monitor the network statistics on Line Side, and by comparing the
statistics between eth0 & eth1, users could get some knowledge of how the User
logic is taking function.

2. Get the link state of the front panel. The XL710 is now connected to
Host Side of the FPGA and the its link state would be always on. So to
check the link state of the front panel, we need to query eth1.

> 
> > +The driver just enables these links on probe.
> > +
> > +The retimer chips are managed by onboard BMC (Board Management Controller)
> > +firmware, host driver is not capable to access them directly.
> 
> What about the QSPF socket? Can the host get access to the I2C bus?
> The pins for TX enable, etc. ethtool -m?

No, the QSPF/I2C are also managed by the BMC firmware, and host doesn't
have interface to talk to BMC firmware about QSPF.

> 
> > +Speed/Duplex
> > +------------
> > +The Ether Group doesn't support auto-negotiation. The link speed is fixed to
> > +10G, 25G or 40G full duplex according to which Ether Group IP is programmed.
> 
> So that means, if i pop out the SFP and put in a different one which
> supports a different speed, it is expected to be broken until the FPGA
> is reloaded?

It is expected to be broken.

Now the line side is expected to be configured to 4x10G, 4x25G, 2x25G, 1x25G.
host side is expected to be 4x10G or 2x40G for XL710.

So 4 channel SFP is expected to be inserted to front panel. And we should use
4x25G SFP, which is compatible to 4x10G connection.

Thanks,
Yilun

> 
>      Andrew
Andrew Lunn Oct. 26, 2020, 1 p.m. UTC | #4
> > > +The Intel(R) PAC N3000 is a FPGA based SmartNIC platform for multi-workload

> > > +networking application acceleration. A simple diagram below to for the board:

> > > +

> > > +                     +----------------------------------------+

> > > +                     |                  FPGA                  |

> > > ++----+   +-------+   +-----------+  +----------+  +-----------+   +----------+

> > > +|QSFP|---|retimer|---|Line Side  |--|User logic|--|Host Side  |---|XL710     |

> > > ++----+   +-------+   |Ether Group|  |          |  |Ether Group|   |Ethernet  |

> > > +                     |(PHY + MAC)|  |wiring &  |  |(MAC + PHY)|   |Controller|

> > > +                     +-----------+  |offloading|  +-----------+   +----------+

> > > +                     |              +----------+              |

> > > +                     |                                        |

> > > +                     +----------------------------------------+

> > 

> > Is XL710 required? I assume any MAC with the correct MII interface

> > will work?

> 

> The XL710 is required for this implementation, in which we have the Host

> Side Ether Group facing the host.  The Host Side Ether Group actually

> contains the same IP blocks as Line Side. It contains the compacted MAC &

> PHY functionalities for 25G/40G case. The 25G MAC-PHY soft IP SPEC can

> be found at:

> 

> https://www.intel.com/content/www/us/en/programmable/documentation/ewo1447742896786.html

> 

> So raw serial data is output from Host Side FPGA, and XL710 is good to

> handle this.


What i have seen working with Marvell Ethernet switches, is that
Marvell normally recommends connecting them to the Ethernet interfaces
of Marvell SoCs. But the switch just needs a compatible MII interface,
and lots of boards make use of non-Marvell MAC chips. Freescale FEC is
very popular.

What i'm trying to say is that ideally we need a collection of generic
drivers for the different major components on the board, and a board
driver which glues it all together. That then allows somebody to build
other boards, or integrate the FPGA directly into an embedded system
directly connected to a SoC, etc.

> > Do you really mean PHY? I actually expect it is PCS? 

> 

> For this implementation, yes.


Yes, you have a PHY? Or Yes, it is PCS?

To me, the phylib maintainer, having a PHY means you have a base-T
interface, 25Gbase-T, 40Gbase-T?  That would be an odd and expensive
architecture when you should be able to just connect SERDES interfaces
together.

> > > +The DFL Ether Group driver registers netdev for each line side link. Users

> > > +could use standard commands (ethtool, ip, ifconfig) for configuration and

> > > +link state/statistics reading. For host side links, they are always connected

> > > +to the host ethernet controller, so they should always have same features as

> > > +the host ethernet controller. There is no need to register netdevs for them.

> > 

> > So lets say the XL710 is eth0. The line side netif is eth1. Where do i

> > put the IP address? What interface do i add to quagga OSPF? 

> 

> The IP address should be put in eth0. eth0 should always be used for the

> tools.


That was what i was afraid of :-)

> 

> The line/host side Ether Group is not the terminal of the network data stream.

> Eth1 will not paticipate in the network data exchange to host.

> 

> The main purposes for eth1 are:

> 1. For users to monitor the network statistics on Line Side, and by comparing the

> statistics between eth0 & eth1, users could get some knowledge of how the User

> logic is taking function.

> 

> 2. Get the link state of the front panel. The XL710 is now connected to

> Host Side of the FPGA and the its link state would be always on. So to

> check the link state of the front panel, we need to query eth1.


This is very non-intuitive. We try to avoid this in the kernel and the
API to userspace. Ethernet switches are always modelled as
accelerators for what the Linux network stack can already do. You
configure an Ethernet switch port in just the same way configure any
other netdev. You add an IP address to the switch port, you get the
Ethernet statistics from the switch port, routing protocols use the
switch port.

You design needs to be the same. All configuration needs to happen via
eth1.

Please look at the DSA architecture. What you have here is very
similar to a two port DSA switch. In DSA terminology, we would call
eth0 the master interface.  It needs to be up, but otherwise the user
does not configure it. eth1 is the slave interface. It is the user
facing interface of the switch. All configuration happens on this
interface. Linux can also send/receive packets on this netdev. The
slave TX function forwards the frame to the master interface netdev,
via a DSA tagger. Frames which eth0 receive are passed through the
tagger and then passed to the slave interface.

All the infrastructure you need is already in place. Please use
it. I'm not saying you need to write a DSA driver, but you should make
use of the same ideas and low level hooks in the network stack which
DSA uses.

> > What about the QSPF socket? Can the host get access to the I2C bus?

> > The pins for TX enable, etc. ethtool -m?

> 

> No, the QSPF/I2C are also managed by the BMC firmware, and host doesn't

> have interface to talk to BMC firmware about QSPF.


So can i even tell what SFP is in the socket? 

> > > +Speed/Duplex

> > > +------------

> > > +The Ether Group doesn't support auto-negotiation. The link speed is fixed to

> > > +10G, 25G or 40G full duplex according to which Ether Group IP is programmed.

> > 

> > So that means, if i pop out the SFP and put in a different one which

> > supports a different speed, it is expected to be broken until the FPGA

> > is reloaded?

> 

> It is expected to be broken.


And since i have no access to the SFP information, i have no idea what
is actually broken? How i should configure the various layers?

> Now the line side is expected to be configured to 4x10G, 4x25G, 2x25G, 1x25G.

> host side is expected to be 4x10G or 2x40G for XL710.

> 

> So 4 channel SFP is expected to be inserted to front panel. And we should use

> 4x25G SFP, which is compatible to 4x10G connection.


So if you had exported the SFP to linux, phylink could of handled some
of this for you. Probably with some extensions to phylink, but Russell
King would of probably helped you. phylink has a good idea how to
decode the SFP EEPROM and figure out the link mode. It has interfaces
to configure PCS blocks, So it could probably deal with the line side
and host side PCS. And it would of been easy to send a udev
notification that the SFP has changed, maybe user space needs to
download a different FPGA bit file? So the user would not see a broken
interface, the hardware could be reconfigured on the fly.

This is one problem i have with this driver. It is based around this
somewhat broken reference design. phylib, along with the hacks you
have, are enough for this reference design. But really you want to
make use of phylink in order to support less limited designs which
will follow. Or you need to push a lot more into the BMC, and don't
use phylib at all.

    Andrew
Xu Yilun Oct. 26, 2020, 5:38 p.m. UTC | #5
On Mon, Oct 26, 2020 at 02:00:01PM +0100, Andrew Lunn wrote:
> > > > +The Intel(R) PAC N3000 is a FPGA based SmartNIC platform for multi-workload

> > > > +networking application acceleration. A simple diagram below to for the board:

> > > > +

> > > > +                     +----------------------------------------+

> > > > +                     |                  FPGA                  |

> > > > ++----+   +-------+   +-----------+  +----------+  +-----------+   +----------+

> > > > +|QSFP|---|retimer|---|Line Side  |--|User logic|--|Host Side  |---|XL710     |

> > > > ++----+   +-------+   |Ether Group|  |          |  |Ether Group|   |Ethernet  |

> > > > +                     |(PHY + MAC)|  |wiring &  |  |(MAC + PHY)|   |Controller|

> > > > +                     +-----------+  |offloading|  +-----------+   +----------+

> > > > +                     |              +----------+              |

> > > > +                     |                                        |

> > > > +                     +----------------------------------------+

> > > 

> > > Is XL710 required? I assume any MAC with the correct MII interface

> > > will work?

> > 

> > The XL710 is required for this implementation, in which we have the Host

> > Side Ether Group facing the host.  The Host Side Ether Group actually

> > contains the same IP blocks as Line Side. It contains the compacted MAC &

> > PHY functionalities for 25G/40G case. The 25G MAC-PHY soft IP SPEC can

> > be found at:

> > 

> > https://www.intel.com/content/www/us/en/programmable/documentation/ewo1447742896786.html

> > 

> > So raw serial data is output from Host Side FPGA, and XL710 is good to

> > handle this.

> 

> What i have seen working with Marvell Ethernet switches, is that

> Marvell normally recommends connecting them to the Ethernet interfaces

> of Marvell SoCs. But the switch just needs a compatible MII interface,

> and lots of boards make use of non-Marvell MAC chips. Freescale FEC is

> very popular.

> 

> What i'm trying to say is that ideally we need a collection of generic

> drivers for the different major components on the board, and a board

> driver which glues it all together. That then allows somebody to build

> other boards, or integrate the FPGA directly into an embedded system

> directly connected to a SoC, etc.

> 

> > > Do you really mean PHY? I actually expect it is PCS? 

> > 

> > For this implementation, yes.

> 

> Yes, you have a PHY? Or Yes, it is PCS?


Sorry, I mean I have a PHY.

> 

> To me, the phylib maintainer, having a PHY means you have a base-T

> interface, 25Gbase-T, 40Gbase-T?  That would be an odd and expensive

> architecture when you should be able to just connect SERDES interfaces

> together.


I see your concerns about the SERDES interface between FPGA & XL710.

Considering the DSA, we just enable the cpu facing ports, seems the
SERDES interface connection doesn't impact the software. It's just too
expensive.

> 

> > > > +The DFL Ether Group driver registers netdev for each line side link. Users

> > > > +could use standard commands (ethtool, ip, ifconfig) for configuration and

> > > > +link state/statistics reading. For host side links, they are always connected

> > > > +to the host ethernet controller, so they should always have same features as

> > > > +the host ethernet controller. There is no need to register netdevs for them.

> > > 

> > > So lets say the XL710 is eth0. The line side netif is eth1. Where do i

> > > put the IP address? What interface do i add to quagga OSPF? 

> > 

> > The IP address should be put in eth0. eth0 should always be used for the

> > tools.

> 

> That was what i was afraid of :-)

> 

> > 

> > The line/host side Ether Group is not the terminal of the network data stream.

> > Eth1 will not paticipate in the network data exchange to host.

> > 

> > The main purposes for eth1 are:

> > 1. For users to monitor the network statistics on Line Side, and by comparing the

> > statistics between eth0 & eth1, users could get some knowledge of how the User

> > logic is taking function.

> > 

> > 2. Get the link state of the front panel. The XL710 is now connected to

> > Host Side of the FPGA and the its link state would be always on. So to

> > check the link state of the front panel, we need to query eth1.

> 

> This is very non-intuitive. We try to avoid this in the kernel and the

> API to userspace. Ethernet switches are always modelled as

> accelerators for what the Linux network stack can already do. You

> configure an Ethernet switch port in just the same way configure any

> other netdev. You add an IP address to the switch port, you get the

> Ethernet statistics from the switch port, routing protocols use the

> switch port.

> 

> You design needs to be the same. All configuration needs to happen via

> eth1.

> 

> Please look at the DSA architecture. What you have here is very

> similar to a two port DSA switch. In DSA terminology, we would call

> eth0 the master interface.  It needs to be up, but otherwise the user

> does not configure it. eth1 is the slave interface. It is the user

> facing interface of the switch. All configuration happens on this

> interface. Linux can also send/receive packets on this netdev. The

> slave TX function forwards the frame to the master interface netdev,

> via a DSA tagger. Frames which eth0 receive are passed through the

> tagger and then passed to the slave interface.

> 

> All the infrastructure you need is already in place. Please use

> it. I'm not saying you need to write a DSA driver, but you should make

> use of the same ideas and low level hooks in the network stack which

> DSA uses.


I did some investigation about the DSA, and actually I wrote a
experimental DSA driver. It works and almost meets my need, I can make
configuration, run pktgen on slave inf.

A main concern for dsa is the wiring from slave inf to master inf depends
on the user logic. If FPGA users want to make their own user logic, they
may need a new driver. But our original design for the FPGA is, kernel
drivers support the fundamental parts - FPGA FIU (where Ether Group is in)
& other peripherals on board, and userspace direct I/O access for User
logic. Then FPGA user don't have to write & compile a driver for their
user logic change.
It seems not that case for netdev. The user logic is a part of the whole
functionality of the netdev, we cannot split part of the hardware
component to userspace and the rest in kernel. I really need to
reconsider this.

> 

> > > What about the QSPF socket? Can the host get access to the I2C bus?

> > > The pins for TX enable, etc. ethtool -m?

> > 

> > No, the QSPF/I2C are also managed by the BMC firmware, and host doesn't

> > have interface to talk to BMC firmware about QSPF.

> 

> So can i even tell what SFP is in the socket? 


No.

> 

> > > > +Speed/Duplex

> > > > +------------

> > > > +The Ether Group doesn't support auto-negotiation. The link speed is fixed to

> > > > +10G, 25G or 40G full duplex according to which Ether Group IP is programmed.

> > > 

> > > So that means, if i pop out the SFP and put in a different one which

> > > supports a different speed, it is expected to be broken until the FPGA

> > > is reloaded?

> > 

> > It is expected to be broken.

> 

> And since i have no access to the SFP information, i have no idea what

> is actually broken? How i should configure the various layers?


With this hardware implementation, I'm afraid host can not know what is broken.
It can just see the Speed of the slave inf is never changed, and the link state
is "No" on slave inf. Is it like the fixed phy or fixed link mode?

Is it possible just see it as fixed and configure the layers?

> 

> > Now the line side is expected to be configured to 4x10G, 4x25G, 2x25G, 1x25G.

> > host side is expected to be 4x10G or 2x40G for XL710.

> > 

> > So 4 channel SFP is expected to be inserted to front panel. And we should use

> > 4x25G SFP, which is compatible to 4x10G connection.

> 

> So if you had exported the SFP to linux, phylink could of handled some

> of this for you. Probably with some extensions to phylink, but Russell

> King would of probably helped you. phylink has a good idea how to

> decode the SFP EEPROM and figure out the link mode. It has interfaces

> to configure PCS blocks, So it could probably deal with the line side

> and host side PCS. And it would of been easy to send a udev

> notification that the SFP has changed, maybe user space needs to

> download a different FPGA bit file? So the user would not see a broken

> interface, the hardware could be reconfigured on the fly.

> 

> This is one problem i have with this driver. It is based around this

> somewhat broken reference design. phylib, along with the hacks you

> have, are enough for this reference design. But really you want to

> make use of phylink in order to support less limited designs which

> will follow. Or you need to push a lot more into the BMC, and don't

> use phylib at all.


Mm.. seems the hardware should be changed, either let host directly
access the QSFP, or re-design the BMC to provide more info for QSFP.

Is it possible we didn't change the hardware, and we support the
components (QSFP, retimer) by fixed-link mode. I know this makes the
driver specific to the board, but the boards are being used by
customers and I'm trying to make them supported without hardware
changes...


Thanks for your very detailed explaination and guide.
Yilun

> 

>     Andrew
Jakub Kicinski Oct. 26, 2020, 6:35 p.m. UTC | #6
On Tue, 27 Oct 2020 01:38:04 +0800 Xu Yilun wrote:
> > > The line/host side Ether Group is not the terminal of the network data stream.
> > > Eth1 will not paticipate in the network data exchange to host.
> > > 
> > > The main purposes for eth1 are:
> > > 1. For users to monitor the network statistics on Line Side, and by comparing the
> > > statistics between eth0 & eth1, users could get some knowledge of how the User
> > > logic is taking function.
> > > 
> > > 2. Get the link state of the front panel. The XL710 is now connected to
> > > Host Side of the FPGA and the its link state would be always on. So to
> > > check the link state of the front panel, we need to query eth1.  
> > 
> > This is very non-intuitive. We try to avoid this in the kernel and the
> > API to userspace. Ethernet switches are always modelled as
> > accelerators for what the Linux network stack can already do. You
> > configure an Ethernet switch port in just the same way configure any
> > other netdev. You add an IP address to the switch port, you get the
> > Ethernet statistics from the switch port, routing protocols use the
> > switch port.
> > 
> > You design needs to be the same. All configuration needs to happen via
> > eth1.
> > 
> > Please look at the DSA architecture. What you have here is very
> > similar to a two port DSA switch. In DSA terminology, we would call
> > eth0 the master interface.  It needs to be up, but otherwise the user
> > does not configure it. eth1 is the slave interface. It is the user
> > facing interface of the switch. All configuration happens on this
> > interface. Linux can also send/receive packets on this netdev. The
> > slave TX function forwards the frame to the master interface netdev,
> > via a DSA tagger. Frames which eth0 receive are passed through the
> > tagger and then passed to the slave interface.
> > 
> > All the infrastructure you need is already in place. Please use
> > it. I'm not saying you need to write a DSA driver, but you should make
> > use of the same ideas and low level hooks in the network stack which
> > DSA uses.  
> 
> I did some investigation about the DSA, and actually I wrote a
> experimental DSA driver. It works and almost meets my need, I can make
> configuration, run pktgen on slave inf.
> 
> A main concern for dsa is the wiring from slave inf to master inf depends
> on the user logic. If FPGA users want to make their own user logic, they
> may need a new driver. But our original design for the FPGA is, kernel
> drivers support the fundamental parts - FPGA FIU (where Ether Group is in)
> & other peripherals on board, and userspace direct I/O access for User
> logic. Then FPGA user don't have to write & compile a driver for their
> user logic change.
> It seems not that case for netdev. The user logic is a part of the whole
> functionality of the netdev, we cannot split part of the hardware
> component to userspace and the rest in kernel. I really need to
> reconsider this.

This is obviously on purpose. Your design as it stands will not fly
upstream, sorry.

From netdev perspective the user should not care how many hardware
blocks are in the pipeline, and on which piece of silicon. You have 
a 2 port (modulo port splitting) card, there should be 2 netdevs, and
the link config and forwarding should be configured through those.

Please let folks at Intel know that we don't like the "SDK in user
space with reuse [/abuse] of parts of netdev infra" architecture.
This is a second of those we see in a short time. Kernel is not a
library for your SDK to use.
Andrew Lunn Oct. 26, 2020, 7:14 p.m. UTC | #7
> > > > Do you really mean PHY? I actually expect it is PCS? 

> > > 

> > > For this implementation, yes.

> > 

> > Yes, you have a PHY? Or Yes, it is PCS?

> 

> Sorry, I mean I have a PHY.

> 

> > 

> > To me, the phylib maintainer, having a PHY means you have a base-T

> > interface, 25Gbase-T, 40Gbase-T?  That would be an odd and expensive

> > architecture when you should be able to just connect SERDES interfaces

> > together.


You really have 25Gbase-T, 40Gbase-T? Between the FPGA & XL710?
What copper PHYs are using? 

> I see your concerns about the SERDES interface between FPGA & XL710.


I have no concerns about direct SERDES connections. That is the normal
way of doing this. It keeps it a lot simpler, since you don't have to
worry about driving the PHYs.

> I did some investigation about the DSA, and actually I wrote a

> experimental DSA driver. It works and almost meets my need, I can make

> configuration, run pktgen on slave inf.


Cool. As i said, I don't know if this actually needs to be a DSA
driver. It might just need to borrow some ideas from DSA.

> Mm.. seems the hardware should be changed, either let host directly

> access the QSFP, or re-design the BMC to provide more info for QSFP.


At a minimum, you need to support ethtool -m. It could be a firmware
call to the BMC, our you expose the i2c bus somehow. There are plenty
of MAC drivers which implement eththool -m without using phylink.

But i think you need to take a step back first, and look at the bigger
picture. What is Intel's goal? Are they just going to sell complete
cards? Or do they also want to sell the FPGA as a components anybody
get put onto their own board?

If there are only ever going to be compete cards, then you can go the
firmware direction, push a lot of functionality into the BMC, and have
the card driver make firmware calls to control the SFP, retimer,
etc. You can then throw away your mdio and phy driver hacks.

If however, the FPGA is going to be available as a component, can you
also assume there is a BMC? Running Intel firmware? Can the customer
also modify this firmware for their own needs? I think that is going
to be difficult. So you need to push as much as possible towards
linux, and let Linux drive all the hardware, the SFP, retimer, FPGA,
etc.

	Andrew
Xu Yilun Oct. 27, 2020, 2:33 a.m. UTC | #8
On Mon, Oct 26, 2020 at 11:35:52AM -0700, Jakub Kicinski wrote:
> On Tue, 27 Oct 2020 01:38:04 +0800 Xu Yilun wrote:

> > > > The line/host side Ether Group is not the terminal of the network data stream.

> > > > Eth1 will not paticipate in the network data exchange to host.

> > > > 

> > > > The main purposes for eth1 are:

> > > > 1. For users to monitor the network statistics on Line Side, and by comparing the

> > > > statistics between eth0 & eth1, users could get some knowledge of how the User

> > > > logic is taking function.

> > > > 

> > > > 2. Get the link state of the front panel. The XL710 is now connected to

> > > > Host Side of the FPGA and the its link state would be always on. So to

> > > > check the link state of the front panel, we need to query eth1.  

> > > 

> > > This is very non-intuitive. We try to avoid this in the kernel and the

> > > API to userspace. Ethernet switches are always modelled as

> > > accelerators for what the Linux network stack can already do. You

> > > configure an Ethernet switch port in just the same way configure any

> > > other netdev. You add an IP address to the switch port, you get the

> > > Ethernet statistics from the switch port, routing protocols use the

> > > switch port.

> > > 

> > > You design needs to be the same. All configuration needs to happen via

> > > eth1.

> > > 

> > > Please look at the DSA architecture. What you have here is very

> > > similar to a two port DSA switch. In DSA terminology, we would call

> > > eth0 the master interface.  It needs to be up, but otherwise the user

> > > does not configure it. eth1 is the slave interface. It is the user

> > > facing interface of the switch. All configuration happens on this

> > > interface. Linux can also send/receive packets on this netdev. The

> > > slave TX function forwards the frame to the master interface netdev,

> > > via a DSA tagger. Frames which eth0 receive are passed through the

> > > tagger and then passed to the slave interface.

> > > 

> > > All the infrastructure you need is already in place. Please use

> > > it. I'm not saying you need to write a DSA driver, but you should make

> > > use of the same ideas and low level hooks in the network stack which

> > > DSA uses.  

> > 

> > I did some investigation about the DSA, and actually I wrote a

> > experimental DSA driver. It works and almost meets my need, I can make

> > configuration, run pktgen on slave inf.

> > 

> > A main concern for dsa is the wiring from slave inf to master inf depends

> > on the user logic. If FPGA users want to make their own user logic, they

> > may need a new driver. But our original design for the FPGA is, kernel

> > drivers support the fundamental parts - FPGA FIU (where Ether Group is in)

> > & other peripherals on board, and userspace direct I/O access for User

> > logic. Then FPGA user don't have to write & compile a driver for their

> > user logic change.

> > It seems not that case for netdev. The user logic is a part of the whole

> > functionality of the netdev, we cannot split part of the hardware

> > component to userspace and the rest in kernel. I really need to

> > reconsider this.

> 

> This is obviously on purpose. Your design as it stands will not fly

> upstream, sorry.

> 

> >From netdev perspective the user should not care how many hardware

> blocks are in the pipeline, and on which piece of silicon. You have 

> a 2 port (modulo port splitting) card, there should be 2 netdevs, and

> the link config and forwarding should be configured through those.

> 

> Please let folks at Intel know that we don't like the "SDK in user

> space with reuse [/abuse] of parts of netdev infra" architecture.

> This is a second of those we see in a short time. Kernel is not a

> library for your SDK to use. 


I get your point. I'll share the information internally and reconsider
the design.

Thanks,
Yilun
Xu Yilun Oct. 27, 2020, 3:27 a.m. UTC | #9
On Mon, Oct 26, 2020 at 08:14:00PM +0100, Andrew Lunn wrote:
> > > > > Do you really mean PHY? I actually expect it is PCS? 
> > > > 
> > > > For this implementation, yes.
> > > 
> > > Yes, you have a PHY? Or Yes, it is PCS?
> > 
> > Sorry, I mean I have a PHY.
> > 
> > > 
> > > To me, the phylib maintainer, having a PHY means you have a base-T
> > > interface, 25Gbase-T, 40Gbase-T?  That would be an odd and expensive
> > > architecture when you should be able to just connect SERDES interfaces
> > > together.
> 
> You really have 25Gbase-T, 40Gbase-T? Between the FPGA & XL710?
> What copper PHYs are using? 

Sorry for the confusing. I'll check with our board designer and reply
later.

> 
> > I see your concerns about the SERDES interface between FPGA & XL710.
> 
> I have no concerns about direct SERDES connections. That is the normal
> way of doing this. It keeps it a lot simpler, since you don't have to
> worry about driving the PHYs.
> 
> > I did some investigation about the DSA, and actually I wrote a
> > experimental DSA driver. It works and almost meets my need, I can make
> > configuration, run pktgen on slave inf.
> 
> Cool. As i said, I don't know if this actually needs to be a DSA
> driver. It might just need to borrow some ideas from DSA.
> 
> > Mm.. seems the hardware should be changed, either let host directly
> > access the QSFP, or re-design the BMC to provide more info for QSFP.
> 
> At a minimum, you need to support ethtool -m. It could be a firmware
> call to the BMC, our you expose the i2c bus somehow. There are plenty
> of MAC drivers which implement eththool -m without using phylink.
> 
> But i think you need to take a step back first, and look at the bigger
> picture. What is Intel's goal? Are they just going to sell complete
> cards? Or do they also want to sell the FPGA as a components anybody
> get put onto their own board?
> 
> If there are only ever going to be compete cards, then you can go the
> firmware direction, push a lot of functionality into the BMC, and have
> the card driver make firmware calls to control the SFP, retimer,
> etc. You can then throw away your mdio and phy driver hacks.
> 
> If however, the FPGA is going to be available as a component, can you
> also assume there is a BMC? Running Intel firmware? Can the customer
> also modify this firmware for their own needs? I think that is going
> to be difficult. So you need to push as much as possible towards
> linux, and let Linux drive all the hardware, the SFP, retimer, FPGA,
> etc.

This is a very helpful. I'll share with our team and reconsider about the
design.

Thanks,
Yilun

> 
> 	Andrew
>
Xu Yilun Nov. 2, 2020, 2:38 a.m. UTC | #10
Hi Andrew:

On Mon, Oct 26, 2020 at 08:14:00PM +0100, Andrew Lunn wrote:
> > > > > Do you really mean PHY? I actually expect it is PCS? 
> > > > 
> > > > For this implementation, yes.
> > > 
> > > Yes, you have a PHY? Or Yes, it is PCS?
> > 
> > Sorry, I mean I have a PHY.
> > 
> > > 
> > > To me, the phylib maintainer, having a PHY means you have a base-T
> > > interface, 25Gbase-T, 40Gbase-T?  That would be an odd and expensive
> > > architecture when you should be able to just connect SERDES interfaces
> > > together.
> 
> You really have 25Gbase-T, 40Gbase-T? Between the FPGA & XL710?
> What copper PHYs are using? 
> 
> > I see your concerns about the SERDES interface between FPGA & XL710.
> 
> I have no concerns about direct SERDES connections. That is the normal
> way of doing this. It keeps it a lot simpler, since you don't have to
> worry about driving the PHYs.
>

I did some investigation and now I have some details.
The term 'PHY' described in Ether Group Spec should be the PCS + PMA, a figure
below for one configuration:

 +------------------------+          +-----------------+
 | Host Side Ether Group  |          |      XL710      |
 |                        |          |                 |
 | +--------------------+ |          |                 |
 | | 40G Ether IP       | |          |                 |
 | |                    | |          |                 |
 | |       +---------+  | |  XLAUI   |                 |
 | | MAC - |PCS - PMA|  | |----------| PMA - PCS - MAC |
 | |       +---------+  | |          |                 |
 +-+--------------------+-+          +-----------------+

Thanks,
Yilun
Andrew Lunn Nov. 2, 2020, 2:46 p.m. UTC | #11
> I did some investigation and now I have some details.

> The term 'PHY' described in Ether Group Spec should be the PCS + PMA, a figure

> below for one configuration:

> 

>  +------------------------+          +-----------------+

>  | Host Side Ether Group  |          |      XL710      |

>  |                        |          |                 |

>  | +--------------------+ |          |                 |

>  | | 40G Ether IP       | |          |                 |

>  | |                    | |          |                 |

>  | |       +---------+  | |  XLAUI   |                 |

>  | | MAC - |PCS - PMA|  | |----------| PMA - PCS - MAC |

>  | |       +---------+  | |          |                 |

>  +-+--------------------+-+          +-----------------+


Thanks, that makes a lot more sense.

	Andrew
diff mbox series

Patch

diff --git a/Documentation/networking/device_drivers/ethernet/index.rst b/Documentation/networking/device_drivers/ethernet/index.rst
index cbb75a18..eb7c443 100644
--- a/Documentation/networking/device_drivers/ethernet/index.rst
+++ b/Documentation/networking/device_drivers/ethernet/index.rst
@@ -26,6 +26,7 @@  Contents:
    freescale/gianfar
    google/gve
    huawei/hinic
+   intel/dfl-eth-group
    intel/e100
    intel/e1000
    intel/e1000e
diff --git a/Documentation/networking/device_drivers/ethernet/intel/dfl-eth-group.rst b/Documentation/networking/device_drivers/ethernet/intel/dfl-eth-group.rst
new file mode 100644
index 0000000..525807e
--- /dev/null
+++ b/Documentation/networking/device_drivers/ethernet/intel/dfl-eth-group.rst
@@ -0,0 +1,102 @@ 
+.. SPDX-License-Identifier: GPL-2.0+
+
+=======================================================================
+DFL device driver for Ether Group private feature on Intel(R) PAC N3000
+=======================================================================
+
+This is the driver for Ether Group private feature on Intel(R)
+PAC (Programmable Acceleration Card) N3000.
+
+The Intel(R) PAC N3000 is a FPGA based SmartNIC platform for multi-workload
+networking application acceleration. A simple diagram below to for the board:
+
+                     +----------------------------------------+
+                     |                  FPGA                  |
++----+   +-------+   +-----------+  +----------+  +-----------+   +----------+
+|QSFP|---|retimer|---|Line Side  |--|User logic|--|Host Side  |---|XL710     |
++----+   +-------+   |Ether Group|  |          |  |Ether Group|   |Ethernet  |
+                     |(PHY + MAC)|  |wiring &  |  |(MAC + PHY)|   |Controller|
+                     +-----------+  |offloading|  +-----------+   +----------+
+                     |              +----------+              |
+                     |                                        |
+                     +----------------------------------------+
+
+The FPGA is composed of FPGA Interface Module (FIM) and Accelerated Function
+Unit (AFU). The FIM implements the basic functionalities for FPGA access,
+management and reprograming, while the AFU is the FPGA reprogramable region for
+users.
+
+The Line Side & Host Side Ether Groups are soft IP blocks embedded in FIM. They
+are internally wire connected to AFU and communicate with AFU with MAC packets.
+The user logic is developed by the FPGA users and re-programmed to AFU,
+providing the user defined wire connections between line side & host side data
+interfaces, as well as the MAC layer offloading.
+
+There are 2 types of interfaces for the Ether Groups:
+
+1. The data interfaces connects the Ether Groups and the AFU, host has no
+ability to control the data stream . So the FPGA is like a pipe between the
+host ethernet controller and the retimer chip.
+
+2. The management interfaces connects the Ether Groups to the host, so host
+could access the Ether Group registers for configuration and statistics
+reading.
+
+The Intel(R) PAC N3000 could be programmed to various configurations (with
+different link numbers and speeds, e.g. 8x10G, 4x25G ...). It is done by
+programing different variants of the Ether Group IP blocks, and doing
+corresponding configuration to the retimer chips.
+
+The DFL Ether Group driver registers netdev for each line side link. Users
+could use standard commands (ethtool, ip, ifconfig) for configuration and
+link state/statistics reading. For host side links, they are always connected
+to the host ethernet controller, so they should always have same features as
+the host ethernet controller. There is no need to register netdevs for them.
+The driver just enables these links on probe.
+
+The retimer chips are managed by onboard BMC (Board Management Controller)
+firmware, host driver is not capable to access them directly. So it is mostly
+like an external fixed PHY. However the link states detected by the retimer
+chips can not be propagated to the Ether Groups for hardware limitation, in
+order to manage the link state, a PHY driver (intel-m10-bmc-retimer) is
+introduced to query the BMC for the retimer's link state. The Ether Group
+driver would connect to the PHY devices and get the link states. The
+intel-m10-bmc-retimer driver creates a peseudo MDIO bus for each board, so
+that the Ether Group driver could find the PHY devices by their peseudo PHY
+addresses.
+
+
+2. Features supported
+=====================
+
+Data Path
+---------
+Since the driver can't control the data stream, the Ether Group driver
+doesn't implement the valid tx/rx functions. Any transmit attempt on these
+links from host will be dropped, and no data could be received to host from
+these links. Users should operate on the netdev of host ethernet controller
+for networking data traffic.
+
+
+Speed/Duplex
+------------
+The Ether Group doesn't support auto-negotiation. The link speed is fixed to
+10G, 25G or 40G full duplex according to which Ether Group IP is programmed.
+
+Statistics
+----------
+The Ether Group IP has the statistics counters for ethernet traffic and errors.
+The user can obtain these MAC-level statistics using "ethtool -S" option.
+
+MTU
+---
+The Ether Group IP is capable of detecting oversized packets. It will not drop
+the packet but pass it up and increment the tx/rx oversize counters. The MTU
+could be changed via ip or ifconfig commands.
+
+Flow Control
+------------
+Ethernet Flow Control (IEEE 802.3x) can be configured with ethtool to enable
+transmitting pause frames. Receiving pause request from outside to Ether Group
+MAC is not supported. The flow control auto-negotiation is not supported. The
+user can enable or disable Tx Flow Control using "ethtool -A eth? tx <on|off>"