mbox series

[rdma-next,v1,00/11] Optional counter statistics support

Message ID cover.1631660727.git.leonro@nvidia.com
Headers show
Series Optional counter statistics support | expand

Message

Leon Romanovsky Sept. 14, 2021, 11:07 p.m. UTC
From: Leon Romanovsky <leonro@nvidia.com>

Change Log:
v1:
* Add a descriptor structure to replace name in struct rdma_hw_stats;
* Add a bitmap in struct rdma_hw_stats to indicate the enable/disable
  status of all counters;
* Add a "flag" field in counter descriptor and define
  IB_STAT_FLAG_OPTIONAL flag;
* add/remove_op_stat() are replaced by modify_op_stat();
* Use "set/unset" in command line and send full opcounters list through
  netlink, and send opcounter indexes instead of names;
* Patches are re-ordered.
v0: https://lore.kernel.org/all/20210818112428.209111-1-markzhang@nvidia.com

----------------------------------------------------------------------
Hi,

This series from Neta and Aharon provides an extension to the rdma
statistics tool that allows to set optional counters dynamically, using
netlink.

The idea of having optional counters is to provide to the users the
ability to get statistics of counters that hurts performance.

Once an optional counter was added, its statistics will be presented
along with all the counters, using the show command.

Binding objects to the optional counters is currently not supported,
neither in auto mode nor in manual mode.

To get the list of optional counters that are supported on this device,
use "rdma statistic mode supported". To see which counters are currently
enabled, use "rdma statistic mode".

Examples:

$ rdma statistic mode supported
link rocep8s0f0/1 supported optional-counters cc_rx_ce_pkts,cc_rx_cnp_pkts,cc_tx_cnp_pkts
link rocep8s0f1/1 supported optional-counters cc_rx_ce_pkts,cc_rx_cnp_pkts,cc_tx_cnp_pkts

$ sudo rdma statistic set link rocep8s0f0/1 optional-counters cc_rx_ce_pkts,cc_rx_cnp_pkts
$ rdma statistic mode link rocep8s0f0/1
link rocep8s0f0/1 optional-counters cc_rx_ce_pkts,cc_rx_cnp_pkts

$ rdma statistic show link rocep8s0f0/1
link rocep8s0f0/1 rx_write_requests 0 rx_read_requests 0 rx_atomic_requests 0 out_of_buffer 0
out_of_sequence 0 duplicate_request 0 rnr_nak_retry_err 0 packet_seq_err 0 implied_nak_seq_err 0
local_ack_timeout_err 0 resp_local_length_error 0 resp_cqe_error 0 req_cqe_error 0
req_remote_invalid_request 0 req_remote_access_errors 0 resp_remote_access_errors 0
resp_cqe_flush_error 0 req_cqe_flush_error 0 roce_adp_retrans 0 roce_adp_retrans_to 0
roce_slow_restart 0 roce_slow_restart_cnps 0 roce_slow_restart_trans 0 rp_cnp_ignored 0
rp_cnp_handled 0 np_ecn_marked_roce_packets 0 np_cnp_sent 0 rx_icrc_encapsulated 0 cc_rx_ce_pkts 0
cc_rx_cnp_pkts 0

$ sudo rdma statistic set link rocep8s0f0/1 optional-counters cc_rx_ce_pkts
$ rdma statistic mode link rocep8s0f0/1
link rocep8s0f0/1 optional-counters cc_rx_ce_pkts

Thanks

Aharon Landau (11):
  net/mlx5: Add ifc bits to support optional counters
  net/mlx5: Add priorities for counters in RDMA namespaces
  RDMA/counter: Add a descriptor in struct rdma_hw_stats
  RDMA/counter: Add an is_disabled field in struct rdma_hw_stats
  RDMA/counter: Add optional counter support
  RDMA/nldev: Add support to get status of all counters
  RDMA/nldev: Allow optional-counter status configuration through RDMA
    netlink
  RDMA/mlx5: Support optional counters in hw_stats initialization
  RDMA/mlx5: Add steering support in optional flow counters
  RDMA/mlx5: Add modify_op_stat() support
  RDMA/mlx5: Add optional counter support in get_hw_stats callback

 drivers/infiniband/core/counters.c            |  30 +-
 drivers/infiniband/core/device.c              |   1 +
 drivers/infiniband/core/nldev.c               | 289 +++++++++++++-----
 drivers/infiniband/core/sysfs.c               |  41 ++-
 drivers/infiniband/hw/bnxt_re/hw_counters.c   | 114 +++----
 drivers/infiniband/hw/cxgb4/provider.c        |  22 +-
 drivers/infiniband/hw/efa/efa_verbs.c         |  19 +-
 drivers/infiniband/hw/hfi1/verbs.c            |  43 +--
 drivers/infiniband/hw/irdma/verbs.c           |  98 +++---
 drivers/infiniband/hw/mlx4/main.c             |  37 +--
 drivers/infiniband/hw/mlx4/mlx4_ib.h          |   2 +-
 drivers/infiniband/hw/mlx5/counters.c         | 280 ++++++++++++++---
 drivers/infiniband/hw/mlx5/fs.c               | 187 ++++++++++++
 drivers/infiniband/hw/mlx5/mlx5_ib.h          |  28 +-
 drivers/infiniband/sw/rxe/rxe_hw_counters.c   |  42 +--
 .../net/ethernet/mellanox/mlx5/core/fs_core.c |  54 +++-
 include/linux/mlx5/device.h                   |   2 +
 include/linux/mlx5/fs.h                       |   2 +
 include/linux/mlx5/mlx5_ifc.h                 |  22 +-
 include/rdma/ib_hdrs.h                        |   1 +
 include/rdma/ib_verbs.h                       |  48 ++-
 include/rdma/rdma_counter.h                   |   2 +
 include/uapi/rdma/rdma_netlink.h              |   3 +
 23 files changed, 1041 insertions(+), 326 deletions(-)

Comments

Jason Gunthorpe Sept. 27, 2021, 4:53 p.m. UTC | #1
On Wed, Sep 15, 2021 at 02:07:23AM +0300, Leon Romanovsky wrote:

> diff --git a/drivers/infiniband/core/nldev.c b/drivers/infiniband/core/nldev.c

> index 3f6b98a87566..67519730b1ac 100644

> +++ b/drivers/infiniband/core/nldev.c

> @@ -968,15 +968,21 @@ static int fill_stat_counter_hwcounters(struct sk_buff *msg,

>  	if (!table_attr)

>  		return -EMSGSIZE;

> @@ -601,11 +604,20 @@ static inline struct rdma_hw_stats *rdma_alloc_hw_stats_struct(

>  	if (!stats)

>  		return NULL;

>  

> +	stats->is_disabled = kcalloc(BITS_TO_LONGS(num_counters),

> +				      sizeof(long), GFP_KERNEL);

> +	if (!stats->is_disabled)

> +		goto err;

> +


Please de-inline this function and make a rdma_free_hw_stats_struct()
call to pair with it. The hw_stats_data kfree should be in there. If
you do this as a precursor patch this patch will be much smaller.

Also, the 

        stats = kzalloc(sizeof(*stats) + num_counters * sizeof(u64),
                        GFP_KERNEL);

Should be using array_size

Jason
Jason Gunthorpe Sept. 27, 2021, 5:03 p.m. UTC | #2
On Wed, Sep 15, 2021 at 02:07:24AM +0300, Leon Romanovsky wrote:
>  

> +int rdma_counter_modify(struct ib_device *dev, u32 port, int index, bool enable)

> +{

> +	struct rdma_hw_stats *stats;

> +	int ret;

> +

> +	if (!dev->ops.modify_hw_stat)

> +		return -EOPNOTSUPP;

> +

> +	stats = ib_get_hw_stats_port(dev, port);

> +	if (!stats)

> +		return -EINVAL;

> +

> +	mutex_lock(&stats->lock);

> +	ret = dev->ops.modify_hw_stat(dev, port, index, enable);

> +	if (!ret)

> +		enable ? clear_bit(index, stats->is_disabled) :

> +			set_bit(index, stats->is_disabled);


This is not a kernel coding style write out the if, use success
oriented flow

Also, shouldn't this logic protect the driver from being called on
non-optional counters?

>  	for (i = 0; i < data->stats->num_counters; i++) {

> -		attr = &data->attrs[i];

> +		if (data->stats->descs[i].flags & IB_STAT_FLAG_OPTIONAL)

> +			continue;

> +		attr = &data->attrs[pos];

>  		sysfs_attr_init(&attr->attr.attr);

>  		attr->attr.attr.name = data->stats->descs[i].name;

>  		attr->attr.attr.mode = 0444;

>  		attr->attr.show = hw_stat_device_show;

>  		attr->show = show_hw_stats;

> -		data->group.attrs[i] = &attr->attr.attr;

> +		data->group.attrs[pos] = &attr->attr.attr;

> +		pos++;

>  	}


This isn't OK, the hw_stat_device_show() computes the stat index like
this:

	return stat_attr->show(ibdev, ibdev->hw_stats_data->stats,
			       stat_attr - ibdev->hw_stats_data->attrs, 0, buf);

Which assumes the stats are packed contiguously. This only works
because mlx5 is always putting the optional stats at the end.

>  /**

>   * struct rdma_stat_desc

>   * @name - The name of the counter

> - *

> + * @flags - Flags of the counter; For example, IB_STAT_FLAG_OPTIONAL

>   */


The previous patch shouldn't have had the extra blank line then?

  
> +int rdma_counter_modify(struct ib_device *dev, u32 port, int index,

> +			bool is_add);


index should be unsigned int

The bool is called 'is_add' here but the implementation is 'enable' ?

Jason
Jason Gunthorpe Sept. 27, 2021, 5:20 p.m. UTC | #3
On Wed, Sep 15, 2021 at 02:07:26AM +0300, Leon Romanovsky wrote:
> -		return -EINVAL;

> +		need_enable = false;

> +		disabled = test_bit(i, stats->is_disabled);

> +		nla_for_each_nested(entry_attr,

> +				    tb[RDMA_NLDEV_ATTR_STAT_HWCOUNTERS], rem) {

> +			index = nla_get_u32(entry_attr);

> +			if (index >= stats->num_counters)

> +				return -EINVAL;

> +			if (i == index) {

> +				need_enable = true;

> +				break;

> +			}

> +		}

>  

> -	port = nla_get_u32(tb[RDMA_NLDEV_ATTR_PORT_INDEX]);

> -	if (!rdma_is_port_valid(device, port)) {

> -		ret = -EINVAL;

> -		goto err;

> +		if (disabled && need_enable)

> +			ret = rdma_counter_modify(device, port, i, true);

> +		else if (!disabled && !need_enable)

> +			ret = rdma_counter_modify(device, port, i, false);


This disabled check looks racy, I would do the no-change optimization inside
rdma_counter_modify()

Also, this is a O(N^2) algorithm, why not do it in one pass with a
small memory allocation for the target state bitmap?

Jason
Mark Zhang Sept. 28, 2021, 3:28 a.m. UTC | #4
On 9/28/2021 12:53 AM, Jason Gunthorpe wrote:
> On Wed, Sep 15, 2021 at 02:07:23AM +0300, Leon Romanovsky wrote:

> 

>> diff --git a/drivers/infiniband/core/nldev.c b/drivers/infiniband/core/nldev.c

>> index 3f6b98a87566..67519730b1ac 100644

>> +++ b/drivers/infiniband/core/nldev.c

>> @@ -968,15 +968,21 @@ static int fill_stat_counter_hwcounters(struct sk_buff *msg,

>>   	if (!table_attr)

>>   		return -EMSGSIZE;

>> @@ -601,11 +604,20 @@ static inline struct rdma_hw_stats *rdma_alloc_hw_stats_struct(

>>   	if (!stats)

>>   		return NULL;

>>   

>> +	stats->is_disabled = kcalloc(BITS_TO_LONGS(num_counters),

>> +				      sizeof(long), GFP_KERNEL);

>> +	if (!stats->is_disabled)

>> +		goto err;

>> +

> 

> Please de-inline this function and make a rdma_free_hw_stats_struct()

> call to pair with it. The hw_stats_data kfree should be in there. If

> you do this as a precursor patch this patch will be much smaller.

> 

> Also, the

> 

>          stats = kzalloc(sizeof(*stats) + num_counters * sizeof(u64),

>                          GFP_KERNEL);

> 

> Should be using array_size

Maybe use struct_size:
   stats = kzalloc(struct_size(stats, value, num_counters), GFP_KERNEL);
Mark Zhang Sept. 28, 2021, 9:03 a.m. UTC | #5
On 9/28/2021 1:03 AM, Jason Gunthorpe wrote:
> On Wed, Sep 15, 2021 at 02:07:24AM +0300, Leon Romanovsky wrote:

>>   

>> +int rdma_counter_modify(struct ib_device *dev, u32 port, int index, bool enable)

>> +{

>> +	struct rdma_hw_stats *stats;

>> +	int ret;

>> +

>> +	if (!dev->ops.modify_hw_stat)

>> +		return -EOPNOTSUPP;

>> +

>> +	stats = ib_get_hw_stats_port(dev, port);

>> +	if (!stats)

>> +		return -EINVAL;

>> +

>> +	mutex_lock(&stats->lock);

>> +	ret = dev->ops.modify_hw_stat(dev, port, index, enable);

>> +	if (!ret)

>> +		enable ? clear_bit(index, stats->is_disabled) :

>> +			set_bit(index, stats->is_disabled);

> 

> This is not a kernel coding style write out the if, use success

> oriented flow

> 

> Also, shouldn't this logic protect the driver from being called on

> non-optional counters?


We leave it to driver, driver would return failure if modify is not 
supported. Is it good?

>>   	for (i = 0; i < data->stats->num_counters; i++) {

>> -		attr = &data->attrs[i];

>> +		if (data->stats->descs[i].flags & IB_STAT_FLAG_OPTIONAL)

>> +			continue;

>> +		attr = &data->attrs[pos];

>>   		sysfs_attr_init(&attr->attr.attr);

>>   		attr->attr.attr.name = data->stats->descs[i].name;

>>   		attr->attr.attr.mode = 0444;

>>   		attr->attr.show = hw_stat_device_show;

>>   		attr->show = show_hw_stats;

>> -		data->group.attrs[i] = &attr->attr.attr;

>> +		data->group.attrs[pos] = &attr->attr.attr;

>> +		pos++;

>>   	}

> 

> This isn't OK, the hw_stat_device_show() computes the stat index like

> this:

> 

> 	return stat_attr->show(ibdev, ibdev->hw_stats_data->stats,

> 			       stat_attr - ibdev->hw_stats_data->attrs, 0, buf);

> 

> Which assumes the stats are packed contiguously. This only works

> because mlx5 is always putting the optional stats at the end.


Yes you are right, thanks. Maybe we can add an "index" field in struct 
hw_stats_device/port_attribute, then set it in setup and use it in show.
Jason Gunthorpe Sept. 28, 2021, 11:51 a.m. UTC | #6
On Tue, Sep 28, 2021 at 05:03:24PM +0800, Mark Zhang wrote:
> On 9/28/2021 1:03 AM, Jason Gunthorpe wrote:

> > On Wed, Sep 15, 2021 at 02:07:24AM +0300, Leon Romanovsky wrote:

> > > +int rdma_counter_modify(struct ib_device *dev, u32 port, int index, bool enable)

> > > +{

> > > +	struct rdma_hw_stats *stats;

> > > +	int ret;

> > > +

> > > +	if (!dev->ops.modify_hw_stat)

> > > +		return -EOPNOTSUPP;

> > > +

> > > +	stats = ib_get_hw_stats_port(dev, port);

> > > +	if (!stats)

> > > +		return -EINVAL;

> > > +

> > > +	mutex_lock(&stats->lock);

> > > +	ret = dev->ops.modify_hw_stat(dev, port, index, enable);

> > > +	if (!ret)

> > > +		enable ? clear_bit(index, stats->is_disabled) :

> > > +			set_bit(index, stats->is_disabled);

> > 

> > This is not a kernel coding style write out the if, use success

> > oriented flow

> > 

> > Also, shouldn't this logic protect the driver from being called on

> > non-optional counters?

> 

> We leave it to driver, driver would return failure if modify is not

> supported. Is it good?


I think the core code should do it

> > >   	for (i = 0; i < data->stats->num_counters; i++) {

> > > -		attr = &data->attrs[i];

> > > +		if (data->stats->descs[i].flags & IB_STAT_FLAG_OPTIONAL)

> > > +			continue;

> > > +		attr = &data->attrs[pos];

> > >   		sysfs_attr_init(&attr->attr.attr);

> > >   		attr->attr.attr.name = data->stats->descs[i].name;

> > >   		attr->attr.attr.mode = 0444;

> > >   		attr->attr.show = hw_stat_device_show;

> > >   		attr->show = show_hw_stats;

> > > -		data->group.attrs[i] = &attr->attr.attr;

> > > +		data->group.attrs[pos] = &attr->attr.attr;

> > > +		pos++;

> > >   	}

> > 

> > This isn't OK, the hw_stat_device_show() computes the stat index like

> > this:

> > 

> > 	return stat_attr->show(ibdev, ibdev->hw_stats_data->stats,

> > 			       stat_attr - ibdev->hw_stats_data->attrs, 0, buf);

> > 

> > Which assumes the stats are packed contiguously. This only works

> > because mlx5 is always putting the optional stats at the end.

> 

> Yes you are right, thanks. Maybe we can add an "index" field in struct

> hw_stats_device/port_attribute, then set it in setup and use it in show.


You could just add a WARN_ON check that optional stats are at the end
I suppose

Jason
Jason Gunthorpe Sept. 28, 2021, 11:51 a.m. UTC | #7
On Tue, Sep 28, 2021 at 11:28:39AM +0800, Mark Zhang wrote:
> On 9/28/2021 12:53 AM, Jason Gunthorpe wrote:

> > On Wed, Sep 15, 2021 at 02:07:23AM +0300, Leon Romanovsky wrote:

> > 

> > > diff --git a/drivers/infiniband/core/nldev.c b/drivers/infiniband/core/nldev.c

> > > index 3f6b98a87566..67519730b1ac 100644

> > > +++ b/drivers/infiniband/core/nldev.c

> > > @@ -968,15 +968,21 @@ static int fill_stat_counter_hwcounters(struct sk_buff *msg,

> > >   	if (!table_attr)

> > >   		return -EMSGSIZE;

> > > @@ -601,11 +604,20 @@ static inline struct rdma_hw_stats *rdma_alloc_hw_stats_struct(

> > >   	if (!stats)

> > >   		return NULL;

> > > +	stats->is_disabled = kcalloc(BITS_TO_LONGS(num_counters),

> > > +				      sizeof(long), GFP_KERNEL);

> > > +	if (!stats->is_disabled)

> > > +		goto err;

> > > +

> > 

> > Please de-inline this function and make a rdma_free_hw_stats_struct()

> > call to pair with it. The hw_stats_data kfree should be in there. If

> > you do this as a precursor patch this patch will be much smaller.

> > 

> > Also, the

> > 

> >          stats = kzalloc(sizeof(*stats) + num_counters * sizeof(u64),

> >                          GFP_KERNEL);

> > 

> > Should be using array_size

> Maybe use struct_size:

>   stats = kzalloc(struct_size(stats, value, num_counters), GFP_KERNEL);


Right

Jason
Leon Romanovsky Sept. 29, 2021, 12:14 p.m. UTC | #8
On Tue, Sep 28, 2021 at 08:51:35AM -0300, Jason Gunthorpe wrote:
> On Tue, Sep 28, 2021 at 05:03:24PM +0800, Mark Zhang wrote:

> > On 9/28/2021 1:03 AM, Jason Gunthorpe wrote:

> > > On Wed, Sep 15, 2021 at 02:07:24AM +0300, Leon Romanovsky wrote:

> > > > +int rdma_counter_modify(struct ib_device *dev, u32 port, int index, bool enable)

> > > > +{

> > > > +	struct rdma_hw_stats *stats;

> > > > +	int ret;

> > > > +

> > > > +	if (!dev->ops.modify_hw_stat)

> > > > +		return -EOPNOTSUPP;

> > > > +

> > > > +	stats = ib_get_hw_stats_port(dev, port);

> > > > +	if (!stats)

> > > > +		return -EINVAL;

> > > > +

> > > > +	mutex_lock(&stats->lock);

> > > > +	ret = dev->ops.modify_hw_stat(dev, port, index, enable);

> > > > +	if (!ret)

> > > > +		enable ? clear_bit(index, stats->is_disabled) :

> > > > +			set_bit(index, stats->is_disabled);

> > > 

> > > This is not a kernel coding style write out the if, use success

> > > oriented flow

> > > 

> > > Also, shouldn't this logic protect the driver from being called on

> > > non-optional counters?

> > 

> > We leave it to driver, driver would return failure if modify is not

> > supported. Is it good?

> 

> I think the core code should do it

> 

> > > >   	for (i = 0; i < data->stats->num_counters; i++) {

> > > > -		attr = &data->attrs[i];

> > > > +		if (data->stats->descs[i].flags & IB_STAT_FLAG_OPTIONAL)

> > > > +			continue;

> > > > +		attr = &data->attrs[pos];

> > > >   		sysfs_attr_init(&attr->attr.attr);

> > > >   		attr->attr.attr.name = data->stats->descs[i].name;

> > > >   		attr->attr.attr.mode = 0444;

> > > >   		attr->attr.show = hw_stat_device_show;

> > > >   		attr->show = show_hw_stats;

> > > > -		data->group.attrs[i] = &attr->attr.attr;

> > > > +		data->group.attrs[pos] = &attr->attr.attr;

> > > > +		pos++;

> > > >   	}

> > > 

> > > This isn't OK, the hw_stat_device_show() computes the stat index like

> > > this:

> > > 

> > > 	return stat_attr->show(ibdev, ibdev->hw_stats_data->stats,

> > > 			       stat_attr - ibdev->hw_stats_data->attrs, 0, buf);

> > > 

> > > Which assumes the stats are packed contiguously. This only works

> > > because mlx5 is always putting the optional stats at the end.

> > 

> > Yes you are right, thanks. Maybe we can add an "index" field in struct

> > hw_stats_device/port_attribute, then set it in setup and use it in show.

> 

> You could just add a WARN_ON check that optional stats are at the end

> I suppose


Everyone adds their counters to the end, last example is bnxt_re
9a381f7e5aa2 ("RDMA/bnxt_re: Add extended statistics counters")

Thanks

> 

> Jason
Leon Romanovsky Sept. 29, 2021, 12:27 p.m. UTC | #9
On Mon, Sep 27, 2021 at 02:20:06PM -0300, Jason Gunthorpe wrote:
> On Wed, Sep 15, 2021 at 02:07:26AM +0300, Leon Romanovsky wrote:

> > -		return -EINVAL;

> > +		need_enable = false;

> > +		disabled = test_bit(i, stats->is_disabled);

> > +		nla_for_each_nested(entry_attr,

> > +				    tb[RDMA_NLDEV_ATTR_STAT_HWCOUNTERS], rem) {

> > +			index = nla_get_u32(entry_attr);

> > +			if (index >= stats->num_counters)

> > +				return -EINVAL;

> > +			if (i == index) {

> > +				need_enable = true;

> > +				break;

> > +			}

> > +		}

> >  

> > -	port = nla_get_u32(tb[RDMA_NLDEV_ATTR_PORT_INDEX]);

> > -	if (!rdma_is_port_valid(device, port)) {

> > -		ret = -EINVAL;

> > -		goto err;

> > +		if (disabled && need_enable)

> > +			ret = rdma_counter_modify(device, port, i, true);

> > +		else if (!disabled && !need_enable)

> > +			ret = rdma_counter_modify(device, port, i, false);

> 

> This disabled check looks racy, I would do the no-change optimization inside

> rdma_counter_modify()

> 

> Also, this is a O(N^2) algorithm, why not do it in one pass with a

> small memory allocation for the target state bitmap?


We don't have many counters. Is this optimization really worth it?

Thanks

> 

> Jason