mbox series

[v2,00/16] block atomic writes

Message ID 20231212110844.19698-1-john.g.garry@oracle.com
Headers show
Series block atomic writes | expand

Message

John Garry Dec. 12, 2023, 11:08 a.m. UTC
This series introduces a proposal to implementing atomic writes in the
kernel for torn-write protection.

This series takes the approach of adding a new "atomic" flag to each of
pwritev2() and iocb->ki_flags - RWF_ATOMIC and IOCB_ATOMIC, respectively.
When set, these indicate that we want the write issued "atomically".

Only direct IO is supported and for block devices here. For this, atomic
write HW is required, like SCSI ATOMIC WRITE (16). For now, XFS support
from earlier versions is sidelined. That is until the interface is agreed
which does not fully rely on HW offload support.

man pages update has been posted at:
https://lore.kernel.org/linux-api/20230929093717.2972367-1-john.g.garry@oracle.com/T/#t
(not updated since I posted v1 kernel series)

The goal here is to provide an interface that allows applications use
application-specific block sizes larger than logical block size
reported by the storage device or larger than filesystem block size as
reported by stat().

With this new interface, application blocks will never be torn or
fractured when written. For a power fail, for each individual application
block, all or none of the data to be written. A racing atomic write and
read will mean that the read sees all the old data or all the new data,
but never a mix of old and new.

Two new fields are added to struct statx - atomic_write_unit_min and
atomic_write_unit_max. For each atomic individual write, the total length
of a write must be a between atomic_write_unit_min and
atomic_write_unit_max, inclusive, and a power-of-2. The write must also be
at a natural offset in the file wrt the write length.

SCSI sd.c and scsi_debug and NVMe kernel support is added.

Some open questions:
- How to make API extensible for when we have no HW support? In that case,
  we would prob not have to follow rule of power-of-2 length et al.
  As a possible solution, maybe we can say that atomic writes are
  supported for the file via statx, but not set unit_min and max values,
  and this means that writes need to be just FS block aligned there.
- For block layer, should atomic_write_unit_max be limited by
  max_sectors_kb? Currently it is not.
- How to improve requirement that iovecs are PAGE-aligned.
  There are 2x issues:
  a. We impose this rule to not split BIOs due to virt boundary for
     NVMe, but there virt boundary is 4K (and not PAGE size, so broken for
     16K/64K pages). Easy solution is to impose requirement that iovecs
     are 4K-aligned.
  b. We don't enforce this rule for virt boundary == 0, i.e. SCSI
- Since debugging torn-writes due to unwanted kernel BIO splitting/merging
  would be horrible, should we add some kernel storage stack software
  integrity checks?

This series is based on v6.7-rc5.

Changes since v1:
- Drop XFS support for now
- Tidy NVMe changes and also add checks for atomic write violating max
  AW PF length and boundary (if any)
- Reject - instead of ignoring - RWF_ATOMIC for files which do not
  support atomic writes
- Update block sysfs documentation
- Various tidy-ups

Alan Adamson (2):
  nvme: Support atomic writes
  nvme: Ensure atomic writes will be executed atomically

Himanshu Madhani (2):
  block: Add atomic write operations to request_queue limits
  block: Add REQ_ATOMIC flag

John Garry (10):
  block: Limit atomic writes according to bio and queue limits
  fs: Increase fmode_t size
  block: Pass blk_queue_get_max_sectors() a request pointer
  block: Limit atomic write IO size according to
    atomic_write_max_sectors
  block: Error an attempt to split an atomic write bio
  block: Add checks to merging of atomic writes
  block: Add fops atomic write support
  scsi: sd: Support reading atomic write properties from block limits
    VPD
  scsi: sd: Add WRITE_ATOMIC_16 support
  scsi: scsi_debug: Atomic write support

Prasad Singamsetty (2):
  fs/bdev: Add atomic write support info to statx
  fs: Add RWF_ATOMIC and IOCB_ATOMIC flags for atomic write support

 Documentation/ABI/stable/sysfs-block |  47 +++
 block/bdev.c                         |  31 +-
 block/blk-merge.c                    |  95 ++++-
 block/blk-mq.c                       |   2 +-
 block/blk-settings.c                 |  84 ++++
 block/blk-sysfs.c                    |  33 ++
 block/blk.h                          |   9 +-
 block/fops.c                         |  40 +-
 drivers/dma-buf/dma-buf.c            |   2 +-
 drivers/nvme/host/core.c             | 108 ++++-
 drivers/nvme/host/nvme.h             |   2 +
 drivers/scsi/scsi_debug.c            | 590 +++++++++++++++++++++------
 drivers/scsi/scsi_trace.c            |  22 +
 drivers/scsi/sd.c                    |  93 ++++-
 drivers/scsi/sd.h                    |   8 +
 fs/stat.c                            |  44 +-
 include/linux/blk_types.h            |   2 +
 include/linux/blkdev.h               |  41 +-
 include/linux/fs.h                   |  11 +
 include/linux/stat.h                 |   2 +
 include/linux/types.h                |   2 +-
 include/scsi/scsi_proto.h            |   1 +
 include/trace/events/scsi.h          |   1 +
 include/uapi/linux/fs.h              |   5 +-
 include/uapi/linux/stat.h            |   7 +-
 25 files changed, 1098 insertions(+), 184 deletions(-)

Comments

John Garry Dec. 13, 2023, 9:32 a.m. UTC | #1
On 12/12/2023 16:32, Christoph Hellwig wrote:
> On Tue, Dec 12, 2023 at 11:08:28AM +0000, John Garry wrote:
>> Two new fields are added to struct statx - atomic_write_unit_min and
>> atomic_write_unit_max. For each atomic individual write, the total length
>> of a write must be a between atomic_write_unit_min and
>> atomic_write_unit_max, inclusive, and a power-of-2. The write must also be
>> at a natural offset in the file wrt the write length.
>>
>> SCSI sd.c and scsi_debug and NVMe kernel support is added.
>>
>> Some open questions:
>> - How to make API extensible for when we have no HW support? In that case,
>>    we would prob not have to follow rule of power-of-2 length et al.
>>    As a possible solution, maybe we can say that atomic writes are
>>    supported for the file via statx, but not set unit_min and max values,
>>    and this means that writes need to be just FS block aligned there.
> I don't think the power of two length is much of a problem to be
> honest, and if we every want to lift it we can still do that easily
> by adding a new flag or limit.

ok, but it would be nice to have some idea on what that flag or limit 
change would be.

> 
> What I'm a lot more worried about is how to tell the file system that
> allocations are done right for these requirement.  There is no way
> a user can know that allocations in an existing file are properly
> aligned, so atomic writes will just fail on existing files.
> 
> I suspect we need an on-disk flag that forces allocations to be
> aligned to the atomic write limit, in some ways similar how the
> XFS rt flag works.  You'd need to set it on an empty file, and all
> allocations after that are guaranteed to be properly aligned.

Hmmm... so how is this different to the XFS forcealign feature?

For XFS, I thought that your idea was to always CoW new extents for 
misaligned extents or writes which spanned multiple extents.

> 
>> - For block layer, should atomic_write_unit_max be limited by
>>    max_sectors_kb? Currently it is not.
> Well.  It must be limited to max_hw_sectors to actually work.

Sure, as max_hw_sectors may also be limited by DMA controller max 
mapping size.

> max_sectors is a software limit below that, which with modern hardware
> is actually pretty silly and a real performance issue with todays
> workloads when people don't tweak it..

Right, so we should limit atomic write queue limits to max_hw_sectors. 
But people can still tweak max_sectors, and I am inclined to say that 
atomic_write_unit_max et al should be (dynamically) limited to 
max_sectors also.

> 
>> - How to improve requirement that iovecs are PAGE-aligned.
>>    There are 2x issues:
>>    a. We impose this rule to not split BIOs due to virt boundary for
>>       NVMe, but there virt boundary is 4K (and not PAGE size, so broken for
>>       16K/64K pages). Easy solution is to impose requirement that iovecs
>>       are 4K-aligned.
>>    b. We don't enforce this rule for virt boundary == 0, i.e. SCSI
> .. we require any device that wants to support atomic writes to not
> have that silly limit.  For NVMe that would require SGL support
> (and some driver changes I've been wanting to make for long where
> we always use SGLs for transfers larger than a single PRP if supported)

If we could avoid dealing with a virt boundary, then that would be nice.

Are there any patches yet for the change to always use SGLs for 
transfers larger than a single PRP?

On a related topic, I am not sure about how - or if we even should - 
enforce iovec PAGE-alignment or length; rather, the user could just be 
advised that iovecs must be PAGE-aligned and min PAGE length to achieve 
atomic_write_unit_max.

> 
> 
>> - Since debugging torn-writes due to unwanted kernel BIO splitting/merging
>>    would be horrible, should we add some kernel storage stack software
>>    integrity checks?
> Yes, I think we'll need asserts in the drivers.  At least for NVMe I
> will insist on them. 

Please see 16/16 in this series.

> For SCSI I think the device actually checks
> because the atomic writes are a different command anyway, or am I
> misunderstanding how SCSI works here?

Right, for WRITE ATOMIC (16) the target will reject a command when it 
does not respect the device atomic write limits.

Thanks,
John
Jan Kara Dec. 13, 2023, 11:20 a.m. UTC | #2
On Tue 12-12-23 11:08:32, John Garry wrote:
> Currently all bits are being used in fmode_t.
> 
> To allow for further expansion, increase from unsigned int to unsigned
> long.
> 
> Since the dma-buf driver prints the file->f_mode member, change the print
> as necessary to deal with the larger size.
> 
> Signed-off-by: John Garry <john.g.garry@oracle.com>

Uh, Al has more experience with fmode_t changes so I'd defer final decision
to him but to me this seems dangerous. Firstly, this breaks packing of
struct file on 64-bit architectures and struct file is highly optimized for
cache efficiency (see the comment before the struct definition). Secondly
this will probably generate warnings on 32-bit architectures as there
sizeof(unsigned long) == sizeof(unsigned int) and so your new flags won't
fit anyway?

								Honza

> ---
>  drivers/dma-buf/dma-buf.c | 2 +-
>  include/linux/types.h     | 2 +-
>  2 files changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
> index 21916bba77d5..a5227ae3d637 100644
> --- a/drivers/dma-buf/dma-buf.c
> +++ b/drivers/dma-buf/dma-buf.c
> @@ -1628,7 +1628,7 @@ static int dma_buf_debug_show(struct seq_file *s, void *unused)
>  
>  
>  		spin_lock(&buf_obj->name_lock);
> -		seq_printf(s, "%08zu\t%08x\t%08x\t%08ld\t%s\t%08lu\t%s\n",
> +		seq_printf(s, "%08zu\t%08x\t%08lx\t%08ld\t%s\t%08lu\t%s\n",
>  				buf_obj->size,
>  				buf_obj->file->f_flags, buf_obj->file->f_mode,
>  				file_count(buf_obj->file),
> diff --git a/include/linux/types.h b/include/linux/types.h
> index 253168bb3fe1..49c754fde1d6 100644
> --- a/include/linux/types.h
> +++ b/include/linux/types.h
> @@ -153,7 +153,7 @@ typedef u32 dma_addr_t;
>  
>  typedef unsigned int __bitwise gfp_t;
>  typedef unsigned int __bitwise slab_flags_t;
> -typedef unsigned int __bitwise fmode_t;
> +typedef unsigned long __bitwise fmode_t;
>  
>  #ifdef CONFIG_PHYS_ADDR_T_64BIT
>  typedef u64 phys_addr_t;
> -- 
> 2.35.3
>
Christian Brauner Dec. 13, 2023, 1:02 p.m. UTC | #3
On Tue, Dec 12, 2023 at 11:08:32AM +0000, John Garry wrote:
> Currently all bits are being used in fmode_t.
> 
> To allow for further expansion, increase from unsigned int to unsigned
> long.
> 
> Since the dma-buf driver prints the file->f_mode member, change the print
> as necessary to deal with the larger size.
> 
> Signed-off-by: John Garry <john.g.garry@oracle.com>
> ---
>  drivers/dma-buf/dma-buf.c | 2 +-
>  include/linux/types.h     | 2 +-
>  2 files changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
> index 21916bba77d5..a5227ae3d637 100644
> --- a/drivers/dma-buf/dma-buf.c
> +++ b/drivers/dma-buf/dma-buf.c
> @@ -1628,7 +1628,7 @@ static int dma_buf_debug_show(struct seq_file *s, void *unused)
>  
>  
>  		spin_lock(&buf_obj->name_lock);
> -		seq_printf(s, "%08zu\t%08x\t%08x\t%08ld\t%s\t%08lu\t%s\n",
> +		seq_printf(s, "%08zu\t%08x\t%08lx\t%08ld\t%s\t%08lu\t%s\n",
>  				buf_obj->size,
>  				buf_obj->file->f_flags, buf_obj->file->f_mode,
>  				file_count(buf_obj->file),
> diff --git a/include/linux/types.h b/include/linux/types.h
> index 253168bb3fe1..49c754fde1d6 100644
> --- a/include/linux/types.h
> +++ b/include/linux/types.h
> @@ -153,7 +153,7 @@ typedef u32 dma_addr_t;
>  
>  typedef unsigned int __bitwise gfp_t;
>  typedef unsigned int __bitwise slab_flags_t;
> -typedef unsigned int __bitwise fmode_t;
> +typedef unsigned long __bitwise fmode_t;

As Jan said, that's likely a bad idea. There's a bunch of places that
assume fmode_t is 32bit. So not really a change we want to make if we
can avoid it.
John Garry Dec. 13, 2023, 1:03 p.m. UTC | #4
On 13/12/2023 11:20, Jan Kara wrote:
>> To allow for further expansion, increase from unsigned int to unsigned
>> long.
>>
>> Since the dma-buf driver prints the file->f_mode member, change the print
>> as necessary to deal with the larger size.
>>
>> Signed-off-by: John Garry<john.g.garry@oracle.com>
> Uh, Al has more experience with fmode_t changes so I'd defer final decision
> to him but to me this seems dangerous.

Ack

> Firstly, this breaks packing of
> struct file on 64-bit architectures and struct file is highly optimized for
> cache efficiency (see the comment before the struct definition).

 From pahole, I think that we still fit on the same 64B cacheline 
(x86_64), but some padding has been added.

Before:
struct file {
union {
	struct llist_node  f_llist; 	/*     0     8 */
                 struct callback_head f_rcuhead 
__attribute__((__aligned__(8))); /*     0    16 */
	unsigned int       f_iocb_flags;         /*     0     4 */
         } __attribute__((__aligned__(8)));	/*     0    16 */
	spinlock_t                 f_lock;	/*    16     4 */
	fmode_t                    f_mode;	/*    20     4 */
	atomic_long_t              f_count;	/*    24     8 */
	struct mutex               f_pos_lock;	/*    32    32 */
         /* --- cacheline 1 boundary (64 bytes) --- */

After:

struct file {
union {
	struct llist_node  f_llist	/*     0     8 */
                 struct callback_head f_rcuhead 
__attribute__((__aligned__(8))); /*     0    16 */
	unsigned int       f_iocb_flags;	/*     0     4 */
         } __attribute__((__aligned__(8)));	/*     0    16 */
	spinlock_t                 f_lock;	 /*    16     4 */

         /* XXX 4 bytes hole, try to pack */

         fmode_t                    f_mode;	/*    24     8 */
         atomic_long_t              f_count;	/*    32     8 */
         struct mutex               f_pos_lock;	/*    40    32 */
         /* --- cacheline 1 boundary (64 bytes) was 8 bytes ago --- */

> Secondly
> this will probably generate warnings on 32-bit architectures as there
> sizeof(unsigned long) == sizeof(unsigned int) and so your new flags won't
> fit anyway?

Right, it would then need to be unsigned long long. Or add another 32b 
member for extended modes. There were no i386 build warnings.

Thanks,
John
John Garry Dec. 13, 2023, 1:15 p.m. UTC | #5
On 13/12/2023 13:02, Christian Brauner wrote:
> On Tue, Dec 12, 2023 at 11:08:32AM +0000, John Garry wrote:
>> Currently all bits are being used in fmode_t.
>>
>> To allow for further expansion, increase from unsigned int to unsigned
>> long.
>>
>> Since the dma-buf driver prints the file->f_mode member, change the print
>> as necessary to deal with the larger size.
>>
>> Signed-off-by: John Garry <john.g.garry@oracle.com>
>> ---
>>   drivers/dma-buf/dma-buf.c | 2 +-
>>   include/linux/types.h     | 2 +-
>>   2 files changed, 2 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
>> index 21916bba77d5..a5227ae3d637 100644
>> --- a/drivers/dma-buf/dma-buf.c
>> +++ b/drivers/dma-buf/dma-buf.c
>> @@ -1628,7 +1628,7 @@ static int dma_buf_debug_show(struct seq_file *s, void *unused)
>>   
>>   
>>   		spin_lock(&buf_obj->name_lock);
>> -		seq_printf(s, "%08zu\t%08x\t%08x\t%08ld\t%s\t%08lu\t%s\n",
>> +		seq_printf(s, "%08zu\t%08x\t%08lx\t%08ld\t%s\t%08lu\t%s\n",
>>   				buf_obj->size,
>>   				buf_obj->file->f_flags, buf_obj->file->f_mode,
>>   				file_count(buf_obj->file),
>> diff --git a/include/linux/types.h b/include/linux/types.h
>> index 253168bb3fe1..49c754fde1d6 100644
>> --- a/include/linux/types.h
>> +++ b/include/linux/types.h
>> @@ -153,7 +153,7 @@ typedef u32 dma_addr_t;
>>   
>>   typedef unsigned int __bitwise gfp_t;
>>   typedef unsigned int __bitwise slab_flags_t;
>> -typedef unsigned int __bitwise fmode_t;
>> +typedef unsigned long __bitwise fmode_t;
> 
> As Jan said, that's likely a bad idea. There's a bunch of places that
> assume fmode_t is 32bit. So not really a change we want to make if we
> can avoid it.

ok, understood.

Some strictly unnecessary bits in f_mode could be recycled (if there 
were any), but this issue will prob come up again.

Could it be considered to add an extended fmode_t member in struct file?

Thanks,
John
Al Viro Dec. 13, 2023, 1:31 p.m. UTC | #6
On Tue, Dec 12, 2023 at 11:08:33AM +0000, John Garry wrote:

> Add file mode flag FMODE_CAN_ATOMIC_WRITE, so files which do not have the
> flag set will have RWF_ATOMIC rejected and not just ignored.
> 
> Signed-off-by: Prasad Singamsetty <prasad.singamsetty@oracle.com>
> Signed-off-by: John Garry <john.g.garry@oracle.com>
> ---
>  include/linux/fs.h      | 8 ++++++++
>  include/uapi/linux/fs.h | 5 ++++-
>  2 files changed, 12 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 70329c81be31..d725c194243c 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -185,6 +185,9 @@ typedef int (dio_iodone_t)(struct kiocb *iocb, loff_t offset,
>  /* File supports async nowait buffered writes */
>  #define FMODE_BUF_WASYNC	((__force fmode_t)0x80000000)
>  
> +/* File supports atomic writes */
> +#define FMODE_CAN_ATOMIC_WRITE	((__force fmode_t)0x100000000)

Have you even tried to compile that on e.g. arm?
Christoph Hellwig Dec. 13, 2023, 3:44 p.m. UTC | #7
On Wed, Dec 13, 2023 at 09:32:06AM +0000, John Garry wrote:
>>> - How to make API extensible for when we have no HW support? In that case,
>>>    we would prob not have to follow rule of power-of-2 length et al.
>>>    As a possible solution, maybe we can say that atomic writes are
>>>    supported for the file via statx, but not set unit_min and max values,
>>>    and this means that writes need to be just FS block aligned there.
>> I don't think the power of two length is much of a problem to be
>> honest, and if we every want to lift it we can still do that easily
>> by adding a new flag or limit.
>
> ok, but it would be nice to have some idea on what that flag or limit 
> change would be.

That would require a concrete use case.  The simples thing for a file
system that can or does log I/O it would simply be a flag waving all
the alignment and size requirements.

>> I suspect we need an on-disk flag that forces allocations to be
>> aligned to the atomic write limit, in some ways similar how the
>> XFS rt flag works.  You'd need to set it on an empty file, and all
>> allocations after that are guaranteed to be properly aligned.
>
> Hmmm... so how is this different to the XFS forcealign feature?

Maybe not much.  But that's not what it is about - we need a common
API for this and not some XFS internal flag.  So if this is something
we could support in ext4 as well that would be a good step.  And for
btrfs you'd probably want to support something like it in nocow mode
if people care enough, or always support atomics and write out of
place.

> For XFS, I thought that your idea was to always CoW new extents for 
> misaligned extents or writes which spanned multiple extents.

Well, that is useful for two things:

 - atomic writes on hardware that does not support it
 - atomic writes for bufferd I/O
 - supporting other sizes / alignments than the strict power of
   two above.

> Right, so we should limit atomic write queue limits to max_hw_sectors. But 
> people can still tweak max_sectors, and I am inclined to say that 
> atomic_write_unit_max et al should be (dynamically) limited to max_sectors 
> also.

Allowing people to tweak it seems to be asking for trouble.

>> have that silly limit.  For NVMe that would require SGL support
>> (and some driver changes I've been wanting to make for long where
>> we always use SGLs for transfers larger than a single PRP if supported)
>
> If we could avoid dealing with a virt boundary, then that would be nice.
>
> Are there any patches yet for the change to always use SGLs for transfers 
> larger than a single PRP?

No.

> On a related topic, I am not sure about how - or if we even should - 
> enforce iovec PAGE-alignment or length; rather, the user could just be 
> advised that iovecs must be PAGE-aligned and min PAGE length to achieve 
> atomic_write_unit_max.

Anything that just advices the user an it not clear cut and results in
an error is data loss waiting to happen.  Even more so if it differs
from device to device.
John Garry Dec. 13, 2023, 4:02 p.m. UTC | #8
On 13/12/2023 13:31, Al Viro wrote:
>> Add file mode flag FMODE_CAN_ATOMIC_WRITE, so files which do not have the
>> flag set will have RWF_ATOMIC rejected and not just ignored.
>>
>> Signed-off-by: Prasad Singamsetty<prasad.singamsetty@oracle.com>
>> Signed-off-by: John Garry<john.g.garry@oracle.com>
>> ---
>>   include/linux/fs.h      | 8 ++++++++
>>   include/uapi/linux/fs.h | 5 ++++-
>>   2 files changed, 12 insertions(+), 1 deletion(-)
>>
>> diff --git a/include/linux/fs.h b/include/linux/fs.h
>> index 70329c81be31..d725c194243c 100644
>> --- a/include/linux/fs.h
>> +++ b/include/linux/fs.h
>> @@ -185,6 +185,9 @@ typedef int (dio_iodone_t)(struct kiocb *iocb, loff_t offset,
>>   /* File supports async nowait buffered writes */
>>   #define FMODE_BUF_WASYNC	((__force fmode_t)0x80000000)
>>   
>> +/* File supports atomic writes */
>> +#define FMODE_CAN_ATOMIC_WRITE	((__force fmode_t)0x100000000)
> Have you even tried to compile that on e.g. arm?

i386 and now arm32, and no grumblings.

I think that the issue is that we only ever do a bitwise OR or test that 
bit 33 for a 32b value, and it is a void operation and ignored.

However if I have file.f_mode = FMODE_CAN_ATOMIC_WRITE and compile for 
arm32, then it complains.
Christoph Hellwig Dec. 13, 2023, 4:03 p.m. UTC | #9
On Wed, Dec 13, 2023 at 02:02:31PM +0100, Christian Brauner wrote:
> >  typedef unsigned int __bitwise gfp_t;
> >  typedef unsigned int __bitwise slab_flags_t;
> > -typedef unsigned int __bitwise fmode_t;
> > +typedef unsigned long __bitwise fmode_t;
> 
> As Jan said, that's likely a bad idea. There's a bunch of places that
> assume fmode_t is 32bit. So not really a change we want to make if we
> can avoid it.

Oh well, let me dust of my series to move the fairly static flags out
of it.  But even without that do we even need to increase it?  There's
still quite a lot of space after FMODE_EXEC for example.
John Garry Dec. 13, 2023, 4:27 p.m. UTC | #10
On 13/12/2023 15:44, Christoph Hellwig wrote:
> On Wed, Dec 13, 2023 at 09:32:06AM +0000, John Garry wrote:
>>>> - How to make API extensible for when we have no HW support? In that case,
>>>>     we would prob not have to follow rule of power-of-2 length et al.
>>>>     As a possible solution, maybe we can say that atomic writes are
>>>>     supported for the file via statx, but not set unit_min and max values,
>>>>     and this means that writes need to be just FS block aligned there.
>>> I don't think the power of two length is much of a problem to be
>>> honest, and if we every want to lift it we can still do that easily
>>> by adding a new flag or limit.
>> ok, but it would be nice to have some idea on what that flag or limit
>> change would be.
> That would require a concrete use case.  The simples thing for a file
> system that can or does log I/O it would simply be a flag waving all
> the alignment and size requirements.

ok...

> 
>>> I suspect we need an on-disk flag that forces allocations to be
>>> aligned to the atomic write limit, in some ways similar how the
>>> XFS rt flag works.  You'd need to set it on an empty file, and all
>>> allocations after that are guaranteed to be properly aligned.
>> Hmmm... so how is this different to the XFS forcealign feature?
> Maybe not much.  But that's not what it is about - we need a common
> API for this and not some XFS internal flag.  So if this is something
> we could support in ext4 as well that would be a good step.  And for
> btrfs you'd probably want to support something like it in nocow mode
> if people care enough, or always support atomics and write out of
> place.

The forcealign feature was a bit of case of knowing specifically what to 
do for XFS to enable atomic writes.

But some common method to configure any FS file for atomic writes (if 
supported) would be preferable, right?

> 
>> For XFS, I thought that your idea was to always CoW new extents for
>> misaligned extents or writes which spanned multiple extents.
> Well, that is useful for two things:
> 
>   - atomic writes on hardware that does not support it
>   - atomic writes for bufferd I/O
>   - supporting other sizes / alignments than the strict power of
>     two above.

Sure

> 
>> Right, so we should limit atomic write queue limits to max_hw_sectors. But
>> people can still tweak max_sectors, and I am inclined to say that
>> atomic_write_unit_max et al should be (dynamically) limited to max_sectors
>> also.
> Allowing people to tweak it seems to be asking for trouble.

I tend to agree....

Let's just limit at max_hw_sectors.

> 
>>> have that silly limit.  For NVMe that would require SGL support
>>> (and some driver changes I've been wanting to make for long where
>>> we always use SGLs for transfers larger than a single PRP if supported)
>> If we could avoid dealing with a virt boundary, then that would be nice.
>>
>> Are there any patches yet for the change to always use SGLs for transfers
>> larger than a single PRP?
> No.

As below, if we are going to enforce alignment/size of iovecs elsewhere, 
then I don't need to worry about virt boundary.

> 
>> On a related topic, I am not sure about how - or if we even should -
>> enforce iovec PAGE-alignment or length; rather, the user could just be
>> advised that iovecs must be PAGE-aligned and min PAGE length to achieve
>> atomic_write_unit_max.
> Anything that just advices the user an it not clear cut and results in
> an error is data loss waiting to happen.  Even more so if it differs
> from device to device.

ok, fine. I suppose that we better enforce it then.

Thanks,
John
John Garry Dec. 14, 2023, 8:56 a.m. UTC | #11
> But even without that do we even need to increase it?  There's
> still quite a lot of space after FMODE_EXEC for example.

Right, I can use the space after FMODE_EXEC, which came free after 
removal of FMODE_NDELAY, FMODE_EXCL, and FMODE_WRITE_IOCTL in v6.5.

Thanks,
John
Christoph Hellwig Dec. 14, 2023, 2:37 p.m. UTC | #12
On Wed, Dec 13, 2023 at 04:27:35PM +0000, John Garry wrote:
>>> Are there any patches yet for the change to always use SGLs for transfers
>>> larger than a single PRP?
>> No.

Here is the WIP version.  With that you'd need to make atomic writes
conditional on !ctrl->need_virt_boundary.

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 8ebdfd623e0f78..e04faffd6551fe 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -1889,7 +1889,8 @@ static void nvme_set_queue_limits(struct nvme_ctrl *ctrl,
 		blk_queue_max_hw_sectors(q, ctrl->max_hw_sectors);
 		blk_queue_max_segments(q, min_t(u32, max_segments, USHRT_MAX));
 	}
-	blk_queue_virt_boundary(q, NVME_CTRL_PAGE_SIZE - 1);
+	if (q == ctrl->admin_q || ctrl->need_virt_boundary)
+		blk_queue_virt_boundary(q, NVME_CTRL_PAGE_SIZE - 1);
 	blk_queue_dma_alignment(q, 3);
 	blk_queue_write_cache(q, vwc, vwc);
 }
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index e7411dac00f725..aa98794a3ec53d 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -262,6 +262,7 @@ enum nvme_ctrl_flags {
 struct nvme_ctrl {
 	bool comp_seen;
 	bool identified;
+	bool need_virt_boundary;
 	enum nvme_ctrl_state state;
 	spinlock_t lock;
 	struct mutex scan_lock;
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 61af7ff1a9d6ba..a8d273b475cb40 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -60,8 +60,7 @@ MODULE_PARM_DESC(max_host_mem_size_mb,
 static unsigned int sgl_threshold = SZ_32K;
 module_param(sgl_threshold, uint, 0644);
 MODULE_PARM_DESC(sgl_threshold,
-		"Use SGLs when average request segment size is larger or equal to "
-		"this size. Use 0 to disable SGLs.");
+		"Use SGLs when > 0. Use 0 to disable SGLs.");
 
 #define NVME_PCI_MIN_QUEUE_SIZE 2
 #define NVME_PCI_MAX_QUEUE_SIZE 4095
@@ -504,23 +503,6 @@ static void nvme_commit_rqs(struct blk_mq_hw_ctx *hctx)
 	spin_unlock(&nvmeq->sq_lock);
 }
 
-static inline bool nvme_pci_use_sgls(struct nvme_dev *dev, struct request *req,
-				     int nseg)
-{
-	struct nvme_queue *nvmeq = req->mq_hctx->driver_data;
-	unsigned int avg_seg_size;
-
-	avg_seg_size = DIV_ROUND_UP(blk_rq_payload_bytes(req), nseg);
-
-	if (!nvme_ctrl_sgl_supported(&dev->ctrl))
-		return false;
-	if (!nvmeq->qid)
-		return false;
-	if (!sgl_threshold || avg_seg_size < sgl_threshold)
-		return false;
-	return true;
-}
-
 static void nvme_free_prps(struct nvme_dev *dev, struct request *req)
 {
 	const int last_prp = NVME_CTRL_PAGE_SIZE / sizeof(__le64) - 1;
@@ -769,12 +751,14 @@ static blk_status_t nvme_setup_sgl_simple(struct nvme_dev *dev,
 static blk_status_t nvme_map_data(struct nvme_dev *dev, struct request *req,
 		struct nvme_command *cmnd)
 {
+	struct nvme_queue *nvmeq = req->mq_hctx->driver_data;
 	struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
+	bool sgl_supported = nvme_ctrl_sgl_supported(&dev->ctrl) &&
+			nvmeq->qid && sgl_threshold;
 	blk_status_t ret = BLK_STS_RESOURCE;
 	int rc;
 
 	if (blk_rq_nr_phys_segments(req) == 1) {
-		struct nvme_queue *nvmeq = req->mq_hctx->driver_data;
 		struct bio_vec bv = req_bvec(req);
 
 		if (!is_pci_p2pdma_page(bv.bv_page)) {
@@ -782,8 +766,7 @@ static blk_status_t nvme_map_data(struct nvme_dev *dev, struct request *req,
 				return nvme_setup_prp_simple(dev, req,
 							     &cmnd->rw, &bv);
 
-			if (nvmeq->qid && sgl_threshold &&
-			    nvme_ctrl_sgl_supported(&dev->ctrl))
+			if (sgl_supported)
 				return nvme_setup_sgl_simple(dev, req,
 							     &cmnd->rw, &bv);
 		}
@@ -806,7 +789,7 @@ static blk_status_t nvme_map_data(struct nvme_dev *dev, struct request *req,
 		goto out_free_sg;
 	}
 
-	if (nvme_pci_use_sgls(dev, req, iod->sgt.nents))
+	if (sgl_supported)
 		ret = nvme_pci_setup_sgls(dev, req, &cmnd->rw);
 	else
 		ret = nvme_pci_setup_prps(dev, req, &cmnd->rw);
@@ -3036,6 +3019,8 @@ static int nvme_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 	result = nvme_init_ctrl_finish(&dev->ctrl, false);
 	if (result)
 		goto out_disable;
+	if (!nvme_ctrl_sgl_supported(&dev->ctrl))
+		dev->ctrl.need_virt_boundary = true;
 
 	nvme_dbbuf_dma_alloc(dev);
 
diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index 81e2621169e5d3..416a9fbcccfc74 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -838,6 +838,7 @@ static int nvme_rdma_configure_admin_queue(struct nvme_rdma_ctrl *ctrl,
 	error = nvme_init_ctrl_finish(&ctrl->ctrl, false);
 	if (error)
 		goto out_quiesce_queue;
+	ctrl->ctrl.need_virt_boundary = true;
 
 	return 0;
John Garry Dec. 14, 2023, 3:46 p.m. UTC | #13
On 14/12/2023 14:37, Christoph Hellwig wrote:
> On Wed, Dec 13, 2023 at 04:27:35PM +0000, John Garry wrote:
>>>> Are there any patches yet for the change to always use SGLs for transfers
>>>> larger than a single PRP?
>>> No.
> Here is the WIP version.  With that you'd need to make atomic writes
> conditional on !ctrl->need_virt_boundary.

Cheers, I gave it a quick spin and on the surface looks ok.

# more /sys/block/nvme0n1/queue/atomic_write_unit_max_bytes
262144
# more /sys/block/nvme0n1/queue/virt_boundary_mask
0
Ming Lei Dec. 15, 2023, 2:27 a.m. UTC | #14
On Tue, Dec 12, 2023 at 11:08:36AM +0000, John Garry wrote:
> Currently an IO size is limited to the request_queue limits max_sectors.
> Limit the size for an atomic write to queue limit atomic_write_max_sectors
> value.
> 
> Signed-off-by: John Garry <john.g.garry@oracle.com>
> ---
>  block/blk-merge.c | 12 +++++++++++-
>  block/blk.h       |  3 +++
>  2 files changed, 14 insertions(+), 1 deletion(-)
> 
> diff --git a/block/blk-merge.c b/block/blk-merge.c
> index 0ccc251e22ff..8d4de9253fe9 100644
> --- a/block/blk-merge.c
> +++ b/block/blk-merge.c
> @@ -171,7 +171,17 @@ static inline unsigned get_max_io_size(struct bio *bio,
>  {
>  	unsigned pbs = lim->physical_block_size >> SECTOR_SHIFT;
>  	unsigned lbs = lim->logical_block_size >> SECTOR_SHIFT;
> -	unsigned max_sectors = lim->max_sectors, start, end;
> +	unsigned max_sectors, start, end;
> +
> +	/*
> +	 * We ignore lim->max_sectors for atomic writes simply because
> +	 * it may less than bio->write_atomic_unit, which we cannot
> +	 * tolerate.
> +	 */
> +	if (bio->bi_opf & REQ_ATOMIC)
> +		max_sectors = lim->atomic_write_max_sectors;
> +	else
> +		max_sectors = lim->max_sectors;

I can understand the trouble for write atomic from bio split, which
may simply split in the max_sectors boundary, however this change is
still too fragile:

1) ->max_sectors may be set from userspace
- so this change simply override userspace setting

2) otherwise ->max_sectors is same with ->max_hw_sectors:

- then something must be wrong in device side or driver side because
->write_atomic_unit conflicts with ->max_hw_sectors, which is supposed
to be figured out before device is setup

3) too big max_sectors may break driver or device, such as nvme-pci
aligns max_hw_sectors with DMA optimized mapping size

And there might be more(better) choices:

1) make sure atomic write limit is respected when userspace updates
->max_sectors

2) when driver finds that atomic write limits conflict with other
existed hardware limits, fail or solve(such as reduce write atomic unit) the
conflict before queue is started; With single write atomic limits update API,
the conflict can be figured out earlier by block layer too.



thanks, 
Ming
John Garry Dec. 15, 2023, 1:55 p.m. UTC | #15
On 15/12/2023 02:27, Ming Lei wrote:
> On Tue, Dec 12, 2023 at 11:08:36AM +0000, John Garry wrote:
>> Currently an IO size is limited to the request_queue limits max_sectors.
>> Limit the size for an atomic write to queue limit atomic_write_max_sectors
>> value.
>>
>> Signed-off-by: John Garry <john.g.garry@oracle.com>
>> ---
>>   block/blk-merge.c | 12 +++++++++++-
>>   block/blk.h       |  3 +++
>>   2 files changed, 14 insertions(+), 1 deletion(-)
>>
>> diff --git a/block/blk-merge.c b/block/blk-merge.c
>> index 0ccc251e22ff..8d4de9253fe9 100644
>> --- a/block/blk-merge.c
>> +++ b/block/blk-merge.c
>> @@ -171,7 +171,17 @@ static inline unsigned get_max_io_size(struct bio *bio,
>>   {
>>   	unsigned pbs = lim->physical_block_size >> SECTOR_SHIFT;
>>   	unsigned lbs = lim->logical_block_size >> SECTOR_SHIFT;
>> -	unsigned max_sectors = lim->max_sectors, start, end;
>> +	unsigned max_sectors, start, end;
>> +
>> +	/*
>> +	 * We ignore lim->max_sectors for atomic writes simply because
>> +	 * it may less than bio->write_atomic_unit, which we cannot
>> +	 * tolerate.
>> +	 */
>> +	if (bio->bi_opf & REQ_ATOMIC)
>> +		max_sectors = lim->atomic_write_max_sectors;
>> +	else
>> +		max_sectors = lim->max_sectors;

Please note that I mentioned this issue in the cover letter, so you can 
see some discussion there (if missed).

> 
> I can understand the trouble for write atomic from bio split, which
> may simply split in the max_sectors boundary, however this change is
> still too fragile:
> 
> 1) ->max_sectors may be set from userspace
> - so this change simply override userspace setting

yes

> 
> 2) otherwise ->max_sectors is same with ->max_hw_sectors:
> 
> - then something must be wrong in device side or driver side because
> ->write_atomic_unit conflicts with ->max_hw_sectors, which is supposed
> to be figured out before device is setup
> 

Right, so I think that it is proper to limit atomic_write_unit_max et al 
to max_hw_sectors or whatever DMA engine device limits. I can make that 
change.

> 3) too big max_sectors may break driver or device, such as nvme-pci
> aligns max_hw_sectors with DMA optimized mapping size
> 
> And there might be more(better) choices:
> 
> 1) make sure atomic write limit is respected when userspace updates
> ->max_sectors

My mind has been changed and I would say no and we can treat 
atomic_write_unit_max as special. Indeed, max_sectors and 
atomic_write_unit_max are complementary. If someone sets max_sectors to 
4KB and then tries an atomic write of 16KB then they don't know what 
they are doing.

My original idea was to dynamically limit atomic_unit_unit_max et al to 
max_sectors (so that max_sectors is always respected for atomic writes).

As an alternative, how about we keep the value of atomic_unit_unit_max 
static, but reject an atomic write if it exceeds max_sectors? It's not 
too different than dynamically limiting atomic_unit_unit_max. But as 
mentioned, it may be asking for trouble....

> 
> 2) when driver finds that atomic write limits conflict with other
> existed hardware limits, fail or solve(such as reduce write atomic unit) the
> conflict before queue is started; With single write atomic limits update API,
> the conflict can be figured out earlier by block layer too.

I think that we can do this, but I am not sure what other queue limits 
may conflict (apart from max_sectors / max_sectors_hw). There is max 
segment size, but we just rely on a single PAGE per iovec to evaluate 
atomic_unit_unit_max, so should not be an issue.

Thanks,
John
Keith Busch Dec. 18, 2023, 10:50 p.m. UTC | #16
On Thu, Dec 14, 2023 at 03:37:09PM +0100, Christoph Hellwig wrote:
> On Wed, Dec 13, 2023 at 04:27:35PM +0000, John Garry wrote:
> >>> Are there any patches yet for the change to always use SGLs for transfers
> >>> larger than a single PRP?
> >> No.
> 
> Here is the WIP version.  With that you'd need to make atomic writes
> conditional on !ctrl->need_virt_boundary.

This looks pretty good as-is!
 
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index 8ebdfd623e0f78..e04faffd6551fe 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -1889,7 +1889,8 @@ static void nvme_set_queue_limits(struct nvme_ctrl *ctrl,
>  		blk_queue_max_hw_sectors(q, ctrl->max_hw_sectors);
>  		blk_queue_max_segments(q, min_t(u32, max_segments, USHRT_MAX));
>  	}
> -	blk_queue_virt_boundary(q, NVME_CTRL_PAGE_SIZE - 1);
> +	if (q == ctrl->admin_q || ctrl->need_virt_boundary)
> +		blk_queue_virt_boundary(q, NVME_CTRL_PAGE_SIZE - 1);
>  	blk_queue_dma_alignment(q, 3);
>  	blk_queue_write_cache(q, vwc, vwc);
>  }
> diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
> index e7411dac00f725..aa98794a3ec53d 100644
> --- a/drivers/nvme/host/nvme.h
> +++ b/drivers/nvme/host/nvme.h
> @@ -262,6 +262,7 @@ enum nvme_ctrl_flags {
>  struct nvme_ctrl {
>  	bool comp_seen;
>  	bool identified;
> +	bool need_virt_boundary;
>  	enum nvme_ctrl_state state;
>  	spinlock_t lock;
>  	struct mutex scan_lock;
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index 61af7ff1a9d6ba..a8d273b475cb40 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -60,8 +60,7 @@ MODULE_PARM_DESC(max_host_mem_size_mb,
>  static unsigned int sgl_threshold = SZ_32K;
>  module_param(sgl_threshold, uint, 0644);
>  MODULE_PARM_DESC(sgl_threshold,
> -		"Use SGLs when average request segment size is larger or equal to "
> -		"this size. Use 0 to disable SGLs.");
> +		"Use SGLs when > 0. Use 0 to disable SGLs.");
>  
>  #define NVME_PCI_MIN_QUEUE_SIZE 2
>  #define NVME_PCI_MAX_QUEUE_SIZE 4095
> @@ -504,23 +503,6 @@ static void nvme_commit_rqs(struct blk_mq_hw_ctx *hctx)
>  	spin_unlock(&nvmeq->sq_lock);
>  }
>  
> -static inline bool nvme_pci_use_sgls(struct nvme_dev *dev, struct request *req,
> -				     int nseg)
> -{
> -	struct nvme_queue *nvmeq = req->mq_hctx->driver_data;
> -	unsigned int avg_seg_size;
> -
> -	avg_seg_size = DIV_ROUND_UP(blk_rq_payload_bytes(req), nseg);
> -
> -	if (!nvme_ctrl_sgl_supported(&dev->ctrl))
> -		return false;
> -	if (!nvmeq->qid)
> -		return false;
> -	if (!sgl_threshold || avg_seg_size < sgl_threshold)
> -		return false;
> -	return true;
> -}
> -
>  static void nvme_free_prps(struct nvme_dev *dev, struct request *req)
>  {
>  	const int last_prp = NVME_CTRL_PAGE_SIZE / sizeof(__le64) - 1;
> @@ -769,12 +751,14 @@ static blk_status_t nvme_setup_sgl_simple(struct nvme_dev *dev,
>  static blk_status_t nvme_map_data(struct nvme_dev *dev, struct request *req,
>  		struct nvme_command *cmnd)
>  {
> +	struct nvme_queue *nvmeq = req->mq_hctx->driver_data;
>  	struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
> +	bool sgl_supported = nvme_ctrl_sgl_supported(&dev->ctrl) &&
> +			nvmeq->qid && sgl_threshold;
>  	blk_status_t ret = BLK_STS_RESOURCE;
>  	int rc;
>  
>  	if (blk_rq_nr_phys_segments(req) == 1) {
> -		struct nvme_queue *nvmeq = req->mq_hctx->driver_data;
>  		struct bio_vec bv = req_bvec(req);
>  
>  		if (!is_pci_p2pdma_page(bv.bv_page)) {
> @@ -782,8 +766,7 @@ static blk_status_t nvme_map_data(struct nvme_dev *dev, struct request *req,
>  				return nvme_setup_prp_simple(dev, req,
>  							     &cmnd->rw, &bv);
>  
> -			if (nvmeq->qid && sgl_threshold &&
> -			    nvme_ctrl_sgl_supported(&dev->ctrl))
> +			if (sgl_supported)
>  				return nvme_setup_sgl_simple(dev, req,
>  							     &cmnd->rw, &bv);
>  		}
> @@ -806,7 +789,7 @@ static blk_status_t nvme_map_data(struct nvme_dev *dev, struct request *req,
>  		goto out_free_sg;
>  	}
>  
> -	if (nvme_pci_use_sgls(dev, req, iod->sgt.nents))
> +	if (sgl_supported)
>  		ret = nvme_pci_setup_sgls(dev, req, &cmnd->rw);
>  	else
>  		ret = nvme_pci_setup_prps(dev, req, &cmnd->rw);
> @@ -3036,6 +3019,8 @@ static int nvme_probe(struct pci_dev *pdev, const struct pci_device_id *id)
>  	result = nvme_init_ctrl_finish(&dev->ctrl, false);
>  	if (result)
>  		goto out_disable;
> +	if (!nvme_ctrl_sgl_supported(&dev->ctrl))
> +		dev->ctrl.need_virt_boundary = true;
>  
>  	nvme_dbbuf_dma_alloc(dev);
>  
> diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
> index 81e2621169e5d3..416a9fbcccfc74 100644
> --- a/drivers/nvme/host/rdma.c
> +++ b/drivers/nvme/host/rdma.c
> @@ -838,6 +838,7 @@ static int nvme_rdma_configure_admin_queue(struct nvme_rdma_ctrl *ctrl,
>  	error = nvme_init_ctrl_finish(&ctrl->ctrl, false);
>  	if (error)
>  		goto out_quiesce_queue;
> +	ctrl->ctrl.need_virt_boundary = true;
>  
>  	return 0;
>
Darrick J. Wong Dec. 19, 2023, 5:14 a.m. UTC | #17
On Wed, Dec 13, 2023 at 04:27:35PM +0000, John Garry wrote:
> On 13/12/2023 15:44, Christoph Hellwig wrote:
> > On Wed, Dec 13, 2023 at 09:32:06AM +0000, John Garry wrote:
> > > > > - How to make API extensible for when we have no HW support? In that case,
> > > > >     we would prob not have to follow rule of power-of-2 length et al.
> > > > >     As a possible solution, maybe we can say that atomic writes are
> > > > >     supported for the file via statx, but not set unit_min and max values,
> > > > >     and this means that writes need to be just FS block aligned there.
> > > > I don't think the power of two length is much of a problem to be
> > > > honest, and if we every want to lift it we can still do that easily
> > > > by adding a new flag or limit.
> > > ok, but it would be nice to have some idea on what that flag or limit
> > > change would be.
> > That would require a concrete use case.  The simples thing for a file
> > system that can or does log I/O it would simply be a flag waving all
> > the alignment and size requirements.
> 
> ok...
> 
> > 
> > > > I suspect we need an on-disk flag that forces allocations to be
> > > > aligned to the atomic write limit, in some ways similar how the
> > > > XFS rt flag works.  You'd need to set it on an empty file, and all
> > > > allocations after that are guaranteed to be properly aligned.
> > > Hmmm... so how is this different to the XFS forcealign feature?
> > Maybe not much.  But that's not what it is about - we need a common
> > API for this and not some XFS internal flag.  So if this is something
> > we could support in ext4 as well that would be a good step.  And for
> > btrfs you'd probably want to support something like it in nocow mode
> > if people care enough, or always support atomics and write out of
> > place.
> 
> The forcealign feature was a bit of case of knowing specifically what to do
> for XFS to enable atomic writes.
> 
> But some common method to configure any FS file for atomic writes (if
> supported) would be preferable, right?

/me stumbles back in with plenty of covidbrain to go around.

So ... Christoph, you're asking for a common API for
sysadmins/applications to signal to the filesystem that they want all
data allocations aligned to a given value, right?

e.g. if a nvme device advertises a capability for untorn writes between
$lbasize and 64k, then we need a way to set up each untorn-file with
some alignment between $lbasize and 64k?

or if cxl storage becomes not ung-dly expensive, then we'd want a way to
set up files with an alignment of 2M (x86) or 512M (arm64lol) to take
advantage of PMD mmap mappings?

I guess we could define a fsxattr xflag for FORCEALIGN and declare that
the fsx_extsize field becomes mandates mapping startoff/startblock
alignment if FORCEALIGN is set.

It just happens that since fsxattr comes from XFS then it'll be easy
enough for us to wire that up.  Maybe.

Concretely, xfs would have to have a way to force the
ag/rtgroup/rtextent size to align with the maximum anticipated required
alignment (e.g. 64k) of the device so that groups (and hence per-group
allocations) don't split the alignment.  xfs would also have need to
constrain the per-inode alignment to some factor of the ag size / rt
extent size.

Writing files could use hw acceleration directly if all the factors are
set up correctly; or fall back to COW.  Assuming there's a way to
require that writeback pick up all the folios in a $forcealign chunk for
simultaneous writeout.  (I think this is a lingering bug in the xfs
rtreflink patchset.)  Also, IIRC John mentioned that there might be some
programs that would rather get an EIO than fall back to COW.

ext4 could maybe sort of do this by allowing forcealign alignments up to
the bigalloc size, if bigalloc is enabled.  Assuming bigalloc isn't DOA.
IIRC bigalloc only supports power of 2 factors.  It wouldn't have the
COW problem because it only supports overwrites.

As usual, I dunno about btrfs.

So does that sound like more or less what you're getting at, Christoph?
Or have I misread the entire thing?  (PS: there's been like 1200 fsdevel
messages since I got sick and I've only had sufficient brainpower to
limp the xfs 6.8 patches across the finish line...)

--D

> > 
> > > For XFS, I thought that your idea was to always CoW new extents for
> > > misaligned extents or writes which spanned multiple extents.
> > Well, that is useful for two things:
> > 
> >   - atomic writes on hardware that does not support it
> >   - atomic writes for bufferd I/O
> >   - supporting other sizes / alignments than the strict power of
> >     two above.
> 
> Sure
> 
> > 
> > > Right, so we should limit atomic write queue limits to max_hw_sectors. But
> > > people can still tweak max_sectors, and I am inclined to say that
> > > atomic_write_unit_max et al should be (dynamically) limited to max_sectors
> > > also.
> > Allowing people to tweak it seems to be asking for trouble.
> 
> I tend to agree....
> 
> Let's just limit at max_hw_sectors.
> 
> > 
> > > > have that silly limit.  For NVMe that would require SGL support
> > > > (and some driver changes I've been wanting to make for long where
> > > > we always use SGLs for transfers larger than a single PRP if supported)
> > > If we could avoid dealing with a virt boundary, then that would be nice.
> > > 
> > > Are there any patches yet for the change to always use SGLs for transfers
> > > larger than a single PRP?
> > No.
> 
> As below, if we are going to enforce alignment/size of iovecs elsewhere,
> then I don't need to worry about virt boundary.
> 
> > 
> > > On a related topic, I am not sure about how - or if we even should -
> > > enforce iovec PAGE-alignment or length; rather, the user could just be
> > > advised that iovecs must be PAGE-aligned and min PAGE length to achieve
> > > atomic_write_unit_max.
> > Anything that just advices the user an it not clear cut and results in
> > an error is data loss waiting to happen.  Even more so if it differs
> > from device to device.
> 
> ok, fine. I suppose that we better enforce it then.
> 
> Thanks,
> John
>
Christoph Hellwig Dec. 19, 2023, 5:21 a.m. UTC | #18
On Mon, Dec 18, 2023 at 09:14:56PM -0800, Darrick J. Wong wrote:
> /me stumbles back in with plenty of covidbrain to go around.
> 
> So ... Christoph, you're asking for a common API for
> sysadmins/applications to signal to the filesystem that they want all
> data allocations aligned to a given value, right?
> 
> e.g. if a nvme device advertises a capability for untorn writes between
> $lbasize and 64k, then we need a way to set up each untorn-file with
> some alignment between $lbasize and 64k?
> 
> or if cxl storage becomes not ung-dly expensive, then we'd want a way to
> set up files with an alignment of 2M (x86) or 512M (arm64lol) to take
> advantage of PMD mmap mappings?

The most important point is to not mix these up.

If we want to use a file for atomic writes I should tell the fs about
it, and preferably in a way that does not require me to know about weird
internal implementation details of every file system.  I really just
want to use atomic writes.  Preferably I'd just start using them after
asking for the limits.  But that's obviously not going to work for
file systems that use the hardware offload and don't otherwise align
to the limit (which would suck for very small files anyway :))

So as a compromise I tell the file system before writing or otherwise
adding any data [1] to file that I want to use atomic writes so that
the fs can take the right decisions.

[1] reflinking data into a such marked file will be ... interesting.
John Garry Dec. 19, 2023, 12:41 p.m. UTC | #19
On 19/12/2023 05:21, Christoph Hellwig wrote:
> On Mon, Dec 18, 2023 at 09:14:56PM -0800, Darrick J. Wong wrote:
>> /me stumbles back in with plenty of covidbrain to go around.
>>
>> So ... Christoph, you're asking for a common API for
>> sysadmins/applications to signal to the filesystem that they want all
>> data allocations aligned to a given value, right?
>>
>> e.g. if a nvme device advertises a capability for untorn writes between
>> $lbasize and 64k, then we need a way to set up each untorn-file with
>> some alignment between $lbasize and 64k?
>>
>> or if cxl storage becomes not ung-dly expensive, then we'd want a way to
>> set up files with an alignment of 2M (x86) or 512M (arm64lol) to take
>> advantage of PMD mmap mappings?
> 
> The most important point is to not mix these up.
> 
> If we want to use a file for atomic writes I should tell the fs about
> it, and preferably in a way that does not require me to know about weird
> internal implementation details of every file system.  I really just
> want to use atomic writes.  Preferably I'd just start using them after
> asking for the limits.  But that's obviously not going to work for
> file systems that use the hardware offload and don't otherwise align
> to the limit (which would suck for very small files anyway :))
> 
> So as a compromise I tell the file system before writing or otherwise
> adding any data [1] to file that I want to use atomic writes so that
> the fs can take the right decisions.
> 
> [1] reflinking data into a such marked file will be ... interesting.
> 

How about something based on fcntl, like below? We will prob also 
require some per-FS flag for enabling atomic writes without HW support. 
That flag might be also useful for XFS for differentiating forcealign 
for atomic writes with just forcealign.

---8<----

diff --git a/fs/fcntl.c b/fs/fcntl.c
index c80a6acad742..d4a50655565a 100644
--- a/fs/fcntl.c
+++ b/fs/fcntl.c
@@ -313,6 +313,15 @@ static long fcntl_rw_hint(struct file *file, 
unsigned int cmd,
  	}
  }

+static long fcntl_atomic_write(struct file *file, unsigned int cmd,
+			  unsigned long arg)
+{
+	if (file->f_op->enable_atomic_writes)
+		return file->f_op->enable_atomic_writes(file, arg);
+
+	return -EOPNOTSUPP;
+}
+
  static long do_fcntl(int fd, unsigned int cmd, unsigned long arg,
  		struct file *filp)
  {
@@ -419,6 +428,9 @@ static long do_fcntl(int fd, unsigned int cmd, 
unsigned long arg,
  	case F_SET_RW_HINT:
  		err = fcntl_rw_hint(filp, cmd, arg);
  		break;
+	case F_SET_ATOMIC_WRITE_SIZE:
+		err = fcntl_atomic_write(filp, cmd, arg);
+		break;
  	default:
  		break;
  	}
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index e33e5e13b95f..dba206cbe1ab 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -1478,6 +1478,46 @@ xfs_file_mmap(
  	return 0;
  }

+STATIC int
+xfs_enable_atomic_writes(
+	struct file	*file,
+	unsigned int unit_max)
+{
+	struct xfs_inode *ip = XFS_I(file_inode(file));
+	struct mnt_idmap *idmap = file_mnt_idmap(file);
+	struct dentry *dentry = file->f_path.dentry;
+	struct xfs_buftarg	*target = xfs_inode_buftarg(ip);
+	struct block_device	*bdev = target->bt_bdev;
+	struct request_queue	*q = bdev->bd_queue;
+	struct inode		*inode = &ip->i_vnode;
+
+	if (queue_atomic_write_unit_max_bytes(q) == 0) {
+		if (unit_max != 0) {
+			/* We expect unbounded CoW size if no HW support */
+			return -ENOTBLK;
+		}
+		/* Do something for CoW support ... */
+
+		return 0;
+	}
+
+	if (!unit_max)
+		return -EINVAL;
+
+	/* bodge alert */
+	if (inode->i_op->fileattr_set) {
+		struct fileattr fa = {
+			.fsx_extsize = unit_max,
+			.fsx_xflags = FS_XFLAG_EXTSIZE | FS_XFLAG_FORCEALIGN,
+			.fsx_valid = 1,
+		};
+
+		return inode->i_op->fileattr_set(idmap, dentry, &fa);
+	}
+
+	return -EOPNOTSUPP;
+}
+
  const struct file_operations xfs_file_operations = {
  	.llseek		= xfs_file_llseek,
  	.read_iter	= xfs_file_read_iter,
@@ -1498,6 +1538,7 @@ const struct file_operations xfs_file_operations = {
  	.fallocate	= xfs_file_fallocate,
  	.fadvise	= xfs_file_fadvise,
  	.remap_file_range = xfs_file_remap_range,
+	.enable_atomic_writes = xfs_enable_atomic_writes,
  };

  const struct file_operations xfs_dir_file_operations = {
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 98b7a7a8c42e..a715f98057b8 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1956,6 +1956,7 @@ struct file_operations {
  	int (*uring_cmd)(struct io_uring_cmd *ioucmd, unsigned int issue_flags);
  	int (*uring_cmd_iopoll)(struct io_uring_cmd *, struct io_comp_batch *,
  				unsigned int poll_flags);
+	int (*enable_atomic_writes)(struct file *, unsigned int unit_max);
  } __randomize_layout;

  /* Wrap a directory iterator that needs exclusive inode access */
diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h
index 6c80f96049bd..69780c49622e 100644
--- a/include/uapi/linux/fcntl.h
+++ b/include/uapi/linux/fcntl.h
@@ -56,6 +56,8 @@
  #define F_GET_FILE_RW_HINT	(F_LINUX_SPECIFIC_BASE + 13)
  #define F_SET_FILE_RW_HINT	(F_LINUX_SPECIFIC_BASE + 14)

+#define F_SET_ATOMIC_WRITE_SIZE  (F_LINUX_SPECIFIC_BASE + 15)
+
  /*
   * Valid hint values for F_{GET,SET}_RW_HINT. 0 is "not set", or can be
   * used to clear any hints previously set.
Christoph Hellwig Dec. 19, 2023, 3:17 p.m. UTC | #20
On Tue, Dec 19, 2023 at 12:41:37PM +0000, John Garry wrote:
> How about something based on fcntl, like below? We will prob also require 
> some per-FS flag for enabling atomic writes without HW support. That flag 
> might be also useful for XFS for differentiating forcealign for atomic 
> writes with just forcealign.

I would have just exposed it through a user visible flag instead of
adding yet another ioctl/fcntl opcode and yet another method.

And yes, for anything that doesn't always support atomic writes it would
need to be persisted.
John Garry Dec. 19, 2023, 4:53 p.m. UTC | #21
On 19/12/2023 15:17, Christoph Hellwig wrote:
> On Tue, Dec 19, 2023 at 12:41:37PM +0000, John Garry wrote:
>> How about something based on fcntl, like below? We will prob also require
>> some per-FS flag for enabling atomic writes without HW support. That flag
>> might be also useful for XFS for differentiating forcealign for atomic
>> writes with just forcealign.
> I would have just exposed it through a user visible flag instead of
> adding yet another ioctl/fcntl opcode and yet another method.
> 

Any specific type of flag?

I would suggest a file attribute which we can set via chattr, but that 
is still using an ioctl and would require a new inode flag; but at least 
there is standard userspace support.

> And yes, for anything that doesn't always support atomic writes it would
> need to be persisted.
Christoph Hellwig Dec. 21, 2023, 6:50 a.m. UTC | #22
On Tue, Dec 19, 2023 at 04:53:27PM +0000, John Garry wrote:
> On 19/12/2023 15:17, Christoph Hellwig wrote:
>> On Tue, Dec 19, 2023 at 12:41:37PM +0000, John Garry wrote:
>>> How about something based on fcntl, like below? We will prob also require
>>> some per-FS flag for enabling atomic writes without HW support. That flag
>>> might be also useful for XFS for differentiating forcealign for atomic
>>> writes with just forcealign.
>> I would have just exposed it through a user visible flag instead of
>> adding yet another ioctl/fcntl opcode and yet another method.
>>
>
> Any specific type of flag?
>
> I would suggest a file attribute which we can set via chattr, but that is 
> still using an ioctl and would require a new inode flag; but at least there 
> is standard userspace support.

I'd be fine with that, but we're kinda running out of flag there.
That's why I suggested the FS_XFLAG_ instead, which basically works
the same.
John Garry Dec. 21, 2023, 12:48 p.m. UTC | #23
On 21/12/2023 12:19, Christoph Hellwig wrote:
>> So then we could have:
>> - ubuf / iovecs need to be PAGE-aligned
>> - each iovec needs to be length of multiple of atomic_write_unit_min. If
>> total length > PAGE_SIZE, each iovec also needs to be a multiple of
>> PAGE_SIZE.
>>
>> I'd rather something simpler. Maybe it's ok.
> If we decided to not support atomic writes on anything setting a virt
> boundary we don't have to care about the alignment of each vector,

ok, I think that alignment is not so important, but we still need to 
consider a minimum length per iovec, such that we will always be able to 
fit a write of length atomic_write_unit_max in a bio.

> and IMHO we should do that as everything else would be a life in
> constant pain.  If we really have a use case for atomic writes on
> consumer NVMe devices we'll just have to limit it to a single iovec.

I'd be more inclined to say that SGL support is compulsory, but I don't 
know if that is limiting atomic writes support to an unsatisfactory 
market segment.
John Garry Dec. 21, 2023, 1:18 p.m. UTC | #24
On 21/12/2023 12:57, Christoph Hellwig wrote:
> On Thu, Dec 21, 2023 at 12:48:24PM +0000, John Garry wrote:
>>>> - ubuf / iovecs need to be PAGE-aligned
>>>> - each iovec needs to be length of multiple of atomic_write_unit_min. If
>>>> total length > PAGE_SIZE, each iovec also needs to be a multiple of
>>>> PAGE_SIZE.
>>>>
>>>> I'd rather something simpler. Maybe it's ok.
>>> If we decided to not support atomic writes on anything setting a virt
>>> boundary we don't have to care about the alignment of each vector,
>>
>> ok, I think that alignment is not so important, but we still need to
>> consider a minimum length per iovec, such that we will always be able to
>> fit a write of length atomic_write_unit_max in a bio.
> 
> I don't think you man a minim length per iovec for that, but a maximum
> number of iovecs instead. 

That would make sense, but I was thinking that if the request queue has 
a limit on segments then we would need to specify a iovec min length.

> For SGL-capable devices that would be
> BIO_MAX_VECS, otherwise 1.

ok, but we would need to advertise that or whatever segment limit. A 
statx field just for that seems a bit inefficient in terms of space.
John Garry Dec. 21, 2023, 1:56 p.m. UTC | #25
On 21/12/2023 13:22, Christoph Hellwig wrote:
> On Thu, Dec 21, 2023 at 01:18:33PM +0000, John Garry wrote:
>>> For SGL-capable devices that would be
>>> BIO_MAX_VECS, otherwise 1.
>>
>> ok, but we would need to advertise that or whatever segment limit. A statx
>> field just for that seems a bit inefficient in terms of space.
> 
> I'd rather not hard code BIO_MAX_VECS in the ABI,

For sure

> which suggest we
> want to export is as a field. 

ok

> Network file systems also might have
> their own limits for one reason or another.

Understood
John Garry Jan. 9, 2024, 9:55 a.m. UTC | #26
On 21/12/2023 06:50, Christoph Hellwig wrote:
> On Tue, Dec 19, 2023 at 04:53:27PM +0000, John Garry wrote:
>> On 19/12/2023 15:17, Christoph Hellwig wrote:
>>> On Tue, Dec 19, 2023 at 12:41:37PM +0000, John Garry wrote:
>>>> How about something based on fcntl, like below? We will prob also require
>>>> some per-FS flag for enabling atomic writes without HW support. That flag
>>>> might be also useful for XFS for differentiating forcealign for atomic
>>>> writes with just forcealign.
>>> I would have just exposed it through a user visible flag instead of
>>> adding yet another ioctl/fcntl opcode and yet another method.
>>>
>> Any specific type of flag?
>>
>> I would suggest a file attribute which we can set via chattr, but that is
>> still using an ioctl and would require a new inode flag; but at least there
>> is standard userspace support.
> I'd be fine with that, but we're kinda running out of flag there.
> That's why I suggested the FS_XFLAG_ instead, which basically works
> the same.

Hi Christoph,

Coming back to this topic... how about this FS_XFLAG_ and fsxattr update:

diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index da43810b7485..9ef15fced20c 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -118,7 +118,8 @@ struct fsxattr {
        __u32           fsx_nextents;   /* nextents field value (get)   */
        __u32           fsx_projid;     /* project identifier (get/set) */
        __u32           fsx_cowextsize; /* CoW extsize field value 
(get/set)*/
-       unsigned char   fsx_pad[8];
+       __u32           fsx_atomicwrites_size; /* unit max */
+       unsigned char   fsx_pad[4];
};

/*
@@ -140,6 +141,7 @@ struct fsxattr {
#define FS_XFLAG_FILESTREAM    0x00004000      /* use filestream 
allocator */
#define FS_XFLAG_DAX           0x00008000      /* use DAX for IO */
#define FS_XFLAG_COWEXTSIZE    0x00010000      /* CoW extent size
allocator hint */
+#define FS_XFLAG_ATOMICWRITES  0x00020000
#define FS_XFLAG_HASATTR       0x80000000      /* no DIFLAG for this   */

/* the read-only stuff doesn't really belong here, but any other place is
lines 1-22/22 (END)

Having FS_XFLAG_ATOMICWRITES set will lead to FMODE_CAN_ATOMIC_WRITE 
being set.

So a user can issue:

 >xfs_io -c "atomic-writes 64K" mnt/file
 >xfs_io -c "atomic-writes" mnt/file
[65536] mnt/file

and then:

/xfs_io -c "lsattr" mnt/file
------------------W mnt/file

(W is new flag for atomic writes obvs)

The user will still have to issue statx to get the actual atomic write 
limit for a file, as 'xfs_io -c "atomic-writes"' does not take into 
account any HW/linux block layer atomic write limits.

FS_XFLAG_ATOMICWRITES will force XFS extent size and alignment to 
fsx_atomicwrites_size when we have HW support, so effectively same as 
forcealign.  For no HW support, we still specify a size. In case of 
possible XFS CoW solution for no atomic write HW support, I suppose that 
there would be no size limit in reality, so the specifying the size 
would only be just for userspace experience consistency.

Is this the sort of userspace API which you would like to see?

John
Christoph Hellwig Jan. 9, 2024, 4:02 p.m. UTC | #27
On Tue, Jan 09, 2024 at 09:55:24AM +0000, John Garry wrote:
> So a user can issue:
>
> >xfs_io -c "atomic-writes 64K" mnt/file
> >xfs_io -c "atomic-writes" mnt/file
> [65536] mnt/file

Let me try to decipher that:

 - the first call sets a 64k fsx_atomicwrites_size size
 - the secon call queries fsx_atomicwrites_size?

> The user will still have to issue statx to get the actual atomic write 
> limit for a file, as 'xfs_io -c "atomic-writes"' does not take into account 
> any HW/linux block layer atomic write limits.

So will the set side never fail?

> Is this the sort of userspace API which you would like to see?

What I had in mind (and that's doesn't mean it's right..) was that
the user just sets a binary flag, and the fs reports the best it
could.  But there might be reasons to do it differently.
John Garry Jan. 9, 2024, 4:52 p.m. UTC | #28
On 09/01/2024 16:02, Christoph Hellwig wrote:
> On Tue, Jan 09, 2024 at 09:55:24AM +0000, John Garry wrote:
>> So a user can issue:
>>
>>> xfs_io -c "atomic-writes 64K" mnt/file
>>> xfs_io -c "atomic-writes" mnt/file
>> [65536] mnt/file
> Let me try to decipher that:
> 
>   - the first call sets a 64k fsx_atomicwrites_size size

It should also set FS_XFLAG_ATOMICWRITES for the file. So this step does 
everything to enable atomic writes for that file.

>   - the secon call queries fsx_atomicwrites_size?

Right, I'm just demo'ing how it would look

> 
>> The user will still have to issue statx to get the actual atomic write
>> limit for a file, as 'xfs_io -c "atomic-writes"' does not take into account
>> any HW/linux block layer atomic write limits.
> So will the set side never fail?

It could fail.

Examples of when it could fail could include:

a. If user gave a bad size value. So the size should be a power-of-2 and 
also divisible into the AG size and compatible with any stripe alignment.

b. If the file already had flags set which are incompatible with or not 
supported for atomic writes.

c. the file already has data written. I guess that's obvious.

> 
>> Is this the sort of userspace API which you would like to see?
> What I had in mind (and that's doesn't mean it's right..) was that
> the user just sets a binary flag, and the fs reports the best it
> could.  But there might be reasons to do it differently.

That is what I am trying to do, but I also want to specify a size for 
the atomic write unit max which the user could want. I'd rather not use 
the atomic write unit max from the device for that, as that could be 
huge. However, re-reading a., above, makes me think that the kernel 
should have more input on this, but we still need some input on the max 
from the user...

Thanks,
John
Dave Chinner Jan. 9, 2024, 11:04 p.m. UTC | #29
On Tue, Jan 09, 2024 at 09:55:24AM +0000, John Garry wrote:
> On 21/12/2023 06:50, Christoph Hellwig wrote:
> > On Tue, Dec 19, 2023 at 04:53:27PM +0000, John Garry wrote:
> > > On 19/12/2023 15:17, Christoph Hellwig wrote:
> > > > On Tue, Dec 19, 2023 at 12:41:37PM +0000, John Garry wrote:
> > > > > How about something based on fcntl, like below? We will prob also require
> > > > > some per-FS flag for enabling atomic writes without HW support. That flag
> > > > > might be also useful for XFS for differentiating forcealign for atomic
> > > > > writes with just forcealign.
> > > > I would have just exposed it through a user visible flag instead of
> > > > adding yet another ioctl/fcntl opcode and yet another method.
> > > > 
> > > Any specific type of flag?
> > > 
> > > I would suggest a file attribute which we can set via chattr, but that is
> > > still using an ioctl and would require a new inode flag; but at least there
> > > is standard userspace support.
> > I'd be fine with that, but we're kinda running out of flag there.
> > That's why I suggested the FS_XFLAG_ instead, which basically works
> > the same.
> 
> Hi Christoph,
> 
> Coming back to this topic... how about this FS_XFLAG_ and fsxattr update:
> 
> diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
> index da43810b7485..9ef15fced20c 100644
> --- a/include/uapi/linux/fs.h
> +++ b/include/uapi/linux/fs.h
> @@ -118,7 +118,8 @@ struct fsxattr {
>        __u32           fsx_nextents;   /* nextents field value (get)   */
>        __u32           fsx_projid;     /* project identifier (get/set) */
>        __u32           fsx_cowextsize; /* CoW extsize field value
> (get/set)*/
> -       unsigned char   fsx_pad[8];
> +       __u32           fsx_atomicwrites_size; /* unit max */
> +       unsigned char   fsx_pad[4];
> };
> 
> /*
> @@ -140,6 +141,7 @@ struct fsxattr {
> #define FS_XFLAG_FILESTREAM    0x00004000      /* use filestream allocator
> */
> #define FS_XFLAG_DAX           0x00008000      /* use DAX for IO */
> #define FS_XFLAG_COWEXTSIZE    0x00010000      /* CoW extent size
> allocator hint */
> +#define FS_XFLAG_ATOMICWRITES  0x00020000
> #define FS_XFLAG_HASATTR       0x80000000      /* no DIFLAG for this   */
> 
> /* the read-only stuff doesn't really belong here, but any other place is
> lines 1-22/22 (END)
> 
> Having FS_XFLAG_ATOMICWRITES set will lead to FMODE_CAN_ATOMIC_WRITE being
> set.
> 
> So a user can issue:
> 
> >xfs_io -c "atomic-writes 64K" mnt/file
> >xfs_io -c "atomic-writes" mnt/file
> [65536] mnt/file

Where are you going to store this value in the inode?  It requires a
new field in the inode and so is a change of on-disk format, right?

As it is, I really don't see this as a better solution than the
original generic "force align" flag that simply makes the extent
size hint alignment a hard physical alignment requirement rather
than just a hint. This has multiple uses (DAX PMD alignment is
another), so I just don't see why something that has a single,
application specific API that implements a hard physical alignment
is desirable.

Indeed, the whole reason that extent size hints are so versatile is
that they implement a generic allocation alignment/size function
that can be used for anything your imagination extends to. If they
were implemented as a "only allow RAID stripe aligned/sized
allocation" for the original use case then that functionality would
have been far less useful than it has proven to be over the past
couple of decades.

Hence history teaches us that we should be designing the API around
the generic filesystem function required (hard alignment of physical
extent allocation), not the specific use case that requires that
functionality.

-Dave.
John Garry Jan. 10, 2024, 8:55 a.m. UTC | #30
On 09/01/2024 23:04, Dave Chinner wrote:
>> --- a/include/uapi/linux/fs.h
>> +++ b/include/uapi/linux/fs.h
>> @@ -118,7 +118,8 @@ struct fsxattr {
>>         __u32           fsx_nextents;   /* nextents field value (get)   */
>>         __u32           fsx_projid;     /* project identifier (get/set) */
>>         __u32           fsx_cowextsize; /* CoW extsize field value
>> (get/set)*/
>> -       unsigned char   fsx_pad[8];
>> +       __u32           fsx_atomicwrites_size; /* unit max */
>> +       unsigned char   fsx_pad[4];
>> };
>>
>> /*
>> @@ -140,6 +141,7 @@ struct fsxattr {
>> #define FS_XFLAG_FILESTREAM    0x00004000      /* use filestream allocator
>> */
>> #define FS_XFLAG_DAX           0x00008000      /* use DAX for IO */
>> #define FS_XFLAG_COWEXTSIZE    0x00010000      /* CoW extent size
>> allocator hint */
>> +#define FS_XFLAG_ATOMICWRITES  0x00020000
>> #define FS_XFLAG_HASATTR       0x80000000      /* no DIFLAG for this   */
>>
>> /* the read-only stuff doesn't really belong here, but any other place is
>> lines 1-22/22 (END)
>>
>> Having FS_XFLAG_ATOMICWRITES set will lead to FMODE_CAN_ATOMIC_WRITE being
>> set.
>>
>> So a user can issue:
>>
>>> xfs_io -c "atomic-writes 64K" mnt/file
>>> xfs_io -c "atomic-writes" mnt/file
>> [65536] mnt/file
> Where are you going to store this value in the inode?  It requires a
> new field in the inode and so is a change of on-disk format, right?

It would require an on-disk format change, unless we can find an 
alternative way to store the value, like:
a. re-use pre-existing extsize or even cowextsize fields and 'xfs_io -c 
"atomic-writes $SIZE"' would update those fields and 
FS_XFLAG_ATOMICWRITES would be incompatible with FS_XFLAG_COWEXTSIZE or 
FS_XFLAG_EXTSIZE
b. require FS_XFLAG_EXTSIZE and extsize be also set to enable atomic 
writes, and extsize is used for atomic write unit max

I'm trying to think of ways to avoid requiring a value, but I don't see 
good options, like:
- make atomic write unit max some compile-time option
- require mkfs stripe alignment/width be set and use that as basis for 
atomic write unit max

We could just use the atomic write unit max which HW provides, but that 
could be 1MB or more and that will hardly give efficient data usage for 
small files. But maybe we don't care about that if we expect this 
feature to only be used on DB files, which can be huge anyway. However I 
still have concerns – we require that value to be fixed, but a disk 
firmware update could increase that value and this could mean we have 
what would be pre-existing mis-aligned extents.

> 
> As it is, I really don't see this as a better solution than the
> original generic "force align" flag that simply makes the extent
> size hint alignment a hard physical alignment requirement rather
> than just a hint. This has multiple uses (DAX PMD alignment is
> another), so I just don't see why something that has a single,
> application specific API that implements a hard physical alignment
> is desirable.

I would still hope that we will support forcealign separately for those 
purposes.

> 
> Indeed, the whole reason that extent size hints are so versatile is
> that they implement a generic allocation alignment/size function
> that can be used for anything your imagination extends to. If they
> were implemented as a "only allow RAID stripe aligned/sized
> allocation" for the original use case then that functionality would
> have been far less useful than it has proven to be over the past
> couple of decades.
> 
> Hence history teaches us that we should be designing the API around
> the generic filesystem function required (hard alignment of physical
> extent allocation), not the specific use case that requires that
> functionality.

I understand your concern. However I am not even sure that forcealign 
even gives us everything we want to enable atomic writes. There is an 
issue where we were required to pre-zero a file prior to issuing atomic 
writes to ensure extents are suitably sized, so FS_XFLAG_ATOMICWRITES 
would make the FS do what is required to avoid that pre-zeroing (but 
that pre-zeroing requirement that does sound like a forcealign issue...)

Furthermore, there was some desire to support atomic writes on block 
devices with no HW support by using a CoW-based solution, and forcealign 
would not be relevant there.

Thanks,
John
Christoph Hellwig Jan. 10, 2024, 9:19 a.m. UTC | #31
On Wed, Jan 10, 2024 at 10:04:00AM +1100, Dave Chinner wrote:
> Hence history teaches us that we should be designing the API around
> the generic filesystem function required (hard alignment of physical
> extent allocation), not the specific use case that requires that
> functionality.

I disagree.  The alignment requirement is an artefact of how you
implement atomic writes.  As the fs user I care that I can do atomic
writes on a file and need to query how big the writes can be and
what alignment is required.

The forcealign feature is a sensible fs side implementation of that
if using hardware based atomic writes with alignment requirements,
but it is a really lousy userspace API.

So with John's API proposal for XFS with hardware alignment based atomic
writes we could still use force align.

Requesting atomic writes for an inode will set the forcealign flag
and the extent size hint, and after that it'll report atomic write
capabilities.  Roughly the same implementation, but not an API
tied to an implementation detail.
Darrick J. Wong Jan. 11, 2024, 1:40 a.m. UTC | #32
On Wed, Jan 10, 2024 at 10:19:29AM +0100, Christoph Hellwig wrote:
> On Wed, Jan 10, 2024 at 10:04:00AM +1100, Dave Chinner wrote:
> > Hence history teaches us that we should be designing the API around
> > the generic filesystem function required (hard alignment of physical
> > extent allocation), not the specific use case that requires that
> > functionality.
> 
> I disagree.  The alignment requirement is an artefact of how you
> implement atomic writes.  As the fs user I care that I can do atomic
> writes on a file and need to query how big the writes can be and
> what alignment is required.
> 
> The forcealign feature is a sensible fs side implementation of that
> if using hardware based atomic writes with alignment requirements,
> but it is a really lousy userspace API.
> 
> So with John's API proposal for XFS with hardware alignment based atomic
> writes we could still use force align.
> 
> Requesting atomic writes for an inode will set the forcealign flag
> and the extent size hint, and after that it'll report atomic write
> capabilities.  Roughly the same implementation, but not an API
> tied to an implementation detail.

Sounds good to me!  So to summarize, this is approximately what
userspace programs would have to do something like this:

struct statx statx;
struct fsxattr fsxattr;
int fd = open('/foofile', O_RDWR | O_DIRECT);

ioctl(fd, FS_IOC_GETXATTR, &fsxattr);

fsxattr.fsx_xflags |= FS_XFLAG_FORCEALIGN | FS_XFLAG_WRITE_ATOMIC;
fsxattr.fsx_extsize = 16384; /* only for hardware no-tears writes */

ioctl(fd, FS_IOC_SETXATTR, &fsxattr);

statx(fd, "", AT_EMPTY_PATH, STATX_ALL | STATX_WRITE_ATOMIC, &statx);

if (statx.stx_atomic_write_unit_max >= 16384) {
	pwrite(fd, &iov, 1, 0, RWF_SYNC | RWF_ATOMIC);
	printf("HAPPY DANCE\n");
}

(Assume we bail out on errors.)

--D
Christoph Hellwig Jan. 11, 2024, 5:02 a.m. UTC | #33
On Wed, Jan 10, 2024 at 05:40:56PM -0800, Darrick J. Wong wrote:
> struct statx statx;
> struct fsxattr fsxattr;
> int fd = open('/foofile', O_RDWR | O_DIRECT);
> 
> ioctl(fd, FS_IOC_GETXATTR, &fsxattr);
> 
> fsxattr.fsx_xflags |= FS_XFLAG_FORCEALIGN | FS_XFLAG_WRITE_ATOMIC;
> fsxattr.fsx_extsize = 16384; /* only for hardware no-tears writes */
> 
> ioctl(fd, FS_IOC_SETXATTR, &fsxattr);
> 
> statx(fd, "", AT_EMPTY_PATH, STATX_ALL | STATX_WRITE_ATOMIC, &statx);
> 
> if (statx.stx_atomic_write_unit_max >= 16384) {
> 	pwrite(fd, &iov, 1, 0, RWF_SYNC | RWF_ATOMIC);
> 	printf("HAPPY DANCE\n");
> }

I think this still needs a check if the fs needs alignment for
atomic writes at all. i.e.

struct statx statx;
struct fsxattr fsxattr;
int fd = open('/foofile', O_RDWR | O_DIRECT);

ioctl(fd, FS_IOC_GETXATTR, &fsxattr);
statx(fd, "", AT_EMPTY_PATH, STATX_ALL | STATX_WRITE_ATOMIC, &statx);
if (statx.stx_atomic_write_unit_max < 16384) {
	bailout();
}

fsxattr.fsx_xflags |= FS_XFLAG_WRITE_ATOMIC;
if (statx.stx_atomic_write_alignment) {
	fsxattr.fsx_xflags |= FS_XFLAG_FORCEALIGN;
	fsxattr.fsx_extsize = 16384; /* only for hardware no-tears writes */
}
if (ioctl(fd, FS_IOC_SETXATTR, &fsxattr) < 1) {
	bailout();
}

pwrite(fd, &iov, 1, 0, RWF_SYNC | RWF_ATOMIC);
printf("HAPPY DANCE\n");
John Garry Jan. 11, 2024, 9:55 a.m. UTC | #34
On 11/01/2024 05:02, Christoph Hellwig wrote:
> On Wed, Jan 10, 2024 at 05:40:56PM -0800, Darrick J. Wong wrote:
>> struct statx statx;
>> struct fsxattr fsxattr;
>> int fd = open('/foofile', O_RDWR | O_DIRECT);

I'm assuming O_CREAT also.

>>
>> ioctl(fd, FS_IOC_GETXATTR, &fsxattr);
>>
>> fsxattr.fsx_xflags |= FS_XFLAG_FORCEALIGN | FS_XFLAG_WRITE_ATOMIC;
>> fsxattr.fsx_extsize = 16384; /* only for hardware no-tears writes */
>>
>> ioctl(fd, FS_IOC_SETXATTR, &fsxattr);
>>
>> statx(fd, "", AT_EMPTY_PATH, STATX_ALL | STATX_WRITE_ATOMIC, &statx);
>>
>> if (statx.stx_atomic_write_unit_max >= 16384) {
>> 	pwrite(fd, &iov, 1, 0, RWF_SYNC | RWF_ATOMIC);
>> 	printf("HAPPY DANCE\n");
>> }
> 
> I think this still needs a check if the fs needs alignment for
> atomic writes at all. i.e.
> 
> struct statx statx;
> struct fsxattr fsxattr;
> int fd = open('/foofile', O_RDWR | O_DIRECT);
> 
> ioctl(fd, FS_IOC_GETXATTR, &fsxattr);
> statx(fd, "", AT_EMPTY_PATH, STATX_ALL | STATX_WRITE_ATOMIC, &statx);
> if (statx.stx_atomic_write_unit_max < 16384) {
> 	bailout();
> }

How could this value be >= 16384 initially? Would it be from 
pre-configured FS alignment, like XFS RT extsize? Or is this from some 
special CoW-based atomic write support? Or FS block size of 16384?

Incidentally, for consistency only setting FS_XFLAG_WRITE_ATOMIC will 
lead to FMODE_CAN_ATOMIC_WRITE being set. So until FS_XFLAG_WRITE_ATOMIC 
is set would it make sense to have statx return 0 for 
STATX_WRITE_ATOMIC. Otherwise the user may be misled to think that it is 
ok to issue an atomic write (when it isn’t).

Thanks,
John

> 
> fsxattr.fsx_xflags |= FS_XFLAG_WRITE_ATOMIC;
> if (statx.stx_atomic_write_alignment) {
> 	fsxattr.fsx_xflags |= FS_XFLAG_FORCEALIGN;
> 	fsxattr.fsx_extsize = 16384; /* only for hardware no-tears writes */
> }
> if (ioctl(fd, FS_IOC_SETXATTR, &fsxattr) < 1) {
> 	bailout();
> }
> 
> pwrite(fd, &iov, 1, 0, RWF_SYNC | RWF_ATOMIC);
> printf("HAPPY DANCE\n");
>
Christoph Hellwig Jan. 11, 2024, 2:45 p.m. UTC | #35
On Thu, Jan 11, 2024 at 09:55:36AM +0000, John Garry wrote:
> On 11/01/2024 05:02, Christoph Hellwig wrote:
>> On Wed, Jan 10, 2024 at 05:40:56PM -0800, Darrick J. Wong wrote:
>>> struct statx statx;
>>> struct fsxattr fsxattr;
>>> int fd = open('/foofile', O_RDWR | O_DIRECT);
>
> I'm assuming O_CREAT also.

Yes.

>> I think this still needs a check if the fs needs alignment for
>> atomic writes at all. i.e.
>>
>> struct statx statx;
>> struct fsxattr fsxattr;
>> int fd = open('/foofile', O_RDWR | O_DIRECT);
>>
>> ioctl(fd, FS_IOC_GETXATTR, &fsxattr);
>> statx(fd, "", AT_EMPTY_PATH, STATX_ALL | STATX_WRITE_ATOMIC, &statx);
>> if (statx.stx_atomic_write_unit_max < 16384) {
>> 	bailout();
>> }
>
> How could this value be >= 16384 initially? Would it be from pre-configured 
> FS alignment, like XFS RT extsize? Or is this from some special CoW-based 
> atomic write support? Or FS block size of 16384?

Sorry, this check should not be here at all, we should only check it
later.

> Incidentally, for consistency only setting FS_XFLAG_WRITE_ATOMIC will lead 
> to FMODE_CAN_ATOMIC_WRITE being set. So until FS_XFLAG_WRITE_ATOMIC is set 
> would it make sense to have statx return 0 for STATX_WRITE_ATOMIC. 

True.  We might need to report the limits even without that, though.
John Garry Jan. 11, 2024, 4:11 p.m. UTC | #36
> 
>>> I think this still needs a check if the fs needs alignment for
>>> atomic writes at all. i.e.
>>>
>>> struct statx statx;
>>> struct fsxattr fsxattr;
>>> int fd = open('/foofile', O_RDWR | O_DIRECT);
>>>
>>> ioctl(fd, FS_IOC_GETXATTR, &fsxattr);
>>> statx(fd, "", AT_EMPTY_PATH, STATX_ALL | STATX_WRITE_ATOMIC, &statx);
>>> if (statx.stx_atomic_write_unit_max < 16384) {
>>> 	bailout();
>>> }
>>
>> How could this value be >= 16384 initially? Would it be from pre-configured
>> FS alignment, like XFS RT extsize? Or is this from some special CoW-based
>> atomic write support? Or FS block size of 16384?
> 
> Sorry, this check should not be here at all, we should only check it
> later.
> 
>> Incidentally, for consistency only setting FS_XFLAG_WRITE_ATOMIC will lead
>> to FMODE_CAN_ATOMIC_WRITE being set. So until FS_XFLAG_WRITE_ATOMIC is set
>> would it make sense to have statx return 0 for STATX_WRITE_ATOMIC.
> 
> True.  We might need to report the limits even without that, though.

Could we just error the SETXATTR ioctl when FS_XFLAG_FORCEALIGN is not 
set (and it is required)? The problem is that ioctl reports -EINVAL for 
any such errors, so hard for the user to know the issue...

Thanks,
John
Christoph Hellwig Jan. 11, 2024, 4:15 p.m. UTC | #37
On Thu, Jan 11, 2024 at 04:11:38PM +0000, John Garry wrote:
> Could we just error the SETXATTR ioctl when FS_XFLAG_FORCEALIGN is not set 
> (and it is required)? The problem is that ioctl reports -EINVAL for any 
> such errors, so hard for the user to know the issue...

Sure.  Pick a good unique error code, though.
John Garry Jan. 16, 2024, 11:35 a.m. UTC | #38
On 21/12/2023 13:22, Christoph Hellwig wrote:
> On Thu, Dec 21, 2023 at 01:18:33PM +0000, John Garry wrote:
>>> For SGL-capable devices that would be
>>> BIO_MAX_VECS, otherwise 1.
>> ok, but we would need to advertise that or whatever segment limit. A statx
>> field just for that seems a bit inefficient in terms of space.
> I'd rather not hard code BIO_MAX_VECS in the ABI, which suggest we
> want to export is as a field.  Network file systems also might have
> their own limits for one reason or another.

Hi Christoph,

I have been looking at this issue again and I am not sure if telling the 
user the max number of segments allowed is the best option. I’m worried 
that resultant atomic write unit max will be too small.

The background again is that we want to tell the user what the maximum 
atomic write unit size is, such that we can always guarantee to fit the 
write in a single bio. And there would be no iovec length or alignment 
rules.

The max segments value advertised would be min(queue max segments, 
BIO_MAX_VECS), so it would be 256 when the request queue is not limiting.

The worst case scenario for iovec layout (most inefficient) which the 
user could provide would be like .iov_base = 0x...0E00 and .iov_length = 
0x400, which would mean that we would have 2x pages and 2x DMA sg elems 
required for each 1024B-length iovec. I am assuming that we will still 
use the direct IO rule of LBS length and alignment.

As such, we then need to set atomic write unit max = min(queue max 
segments, BIO_MAX_VECS) * LBS. That would mean atomic write unit max 256 
* 512 = 128K (for 512B LBS). For a DMA controller of max segments 64, 
for example, then we would have 32K. These seem too low.

Alternative I'm thinking that we should just limit to 1x iovec always, 
and then atomic write unit max = (min(queue max segments, BIO_MAX_VECS) 
- 1) * PAGE_SIZE [ignoring first/last iovec contents]. It also makes 
support for non-enterprise NVMe drives more straightforward. If someone 
wants, they can introduce support for multi-iovec later, but it would 
prob require some more iovec length/alignment rules.

Please let me know your thoughts.

Thanks,
John
John Garry Jan. 17, 2024, 4:16 p.m. UTC | #39
On 17/01/2024 15:02, Christoph Hellwig wrote:
> On Tue, Jan 16, 2024 at 11:35:47AM +0000, John Garry wrote:
>> As such, we then need to set atomic write unit max = min(queue max
>> segments, BIO_MAX_VECS) * LBS. That would mean atomic write unit max 256 *
>> 512 = 128K (for 512B LBS). For a DMA controller of max segments 64, for
>> example, then we would have 32K. These seem too low.
> 
> I don't see how this would work if support multiple sectors.
> 
>>
>> Alternative I'm thinking that we should just limit to 1x iovec always, and
>> then atomic write unit max = (min(queue max segments, BIO_MAX_VECS) - 1) *
>> PAGE_SIZE [ignoring first/last iovec contents]. It also makes support for
>> non-enterprise NVMe drives more straightforward. If someone wants, they can
>> introduce support for multi-iovec later, but it would prob require some
>> more iovec length/alignment rules.
> 
> Supporting just a single iovec initially is fine with me, as extending
> that is pretty easy.  Just talk to your potential users that they can
> live with it.

Yeah, any porting I know about has just been using aio with 
IO_CMD_PWRITE, so would be ok

> 
> I'd probably still advertise the limits even if it currently always is 1.
> 

I suppose that we don't need any special rule until we support > 1, as 
we cannot break anyone already using > 1 :)

Thanks,
John
John Garry Jan. 22, 2024, 8:29 a.m. UTC | #40
On 12/12/2023 11:08, John Garry wrote:
> --- a/include/uapi/linux/fs.h
> +++ b/include/uapi/linux/fs.h
> @@ -301,9 +301,12 @@ typedef int __bitwise __kernel_rwf_t;
>   /* per-IO O_APPEND */
>   #define RWF_APPEND	((__force __kernel_rwf_t)0x00000010)
>   
> +/* Atomic Write */
> +#define RWF_ATOMIC	((__force __kernel_rwf_t)0x00000020)
> +

vfs maintainers, due to 
https://lore.kernel.org/linux-fsdevel/20240119191053.GT22081@brightrain.aerifal.cx/T/#t, 
I now plan to change this value to 0x40.

Thanks,
John