diff mbox series

[v5,3/5] fcntl: add F_{SET/GET}_RW_HINT_EX

Message ID 20240910150200.6589-4-joshi.k@samsung.com
State New
Headers show
Series data placement hints and FDP | expand

Commit Message

Kanchan Joshi Sept. 10, 2024, 3:01 p.m. UTC
This is similar to existing F_{SET/GET}_RW_HINT but more
generic/extensible.

F_SET/GET_RW_HINT_EX take a pointer to a struct rw_hint_ex as argument:

struct rw_hint_ex {
        __u8    type;
        __u8    pad[7];
        __u64   val;
};

With F_SET_RW_HINT_EX, the user passes the hint type and its value.
Hint type can be either lifetime hint (TYPE_RW_LIFETIME_HINT) or
placement hint (TYPE_RW_PLACEMENT_HINT). The interface allows to add
more hint add more hint types in future.

Valid values for life hints are same as values supported by existing
fcntl(F_SET_RW_HINT).
Valid values for placement hints are between 0 to 126, both inclusive.

The inode retains either the lifetime hint or the placement hint, whichever
is set later. The set hint type and its value can be queried by
F_GET_RW_HINT_EX.

The i_write_hint field of the inode is a 1-byte field. Use the most
significant bit as the hint type. This bit is set for placement hint.
For lifetime hint, this bit remains zero.

Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com>
---
 fs/fcntl.c                 | 67 ++++++++++++++++++++++++++++++++++++++
 include/linux/rw_hint.h    | 13 ++++++++
 include/uapi/linux/fcntl.h | 14 ++++++++
 3 files changed, 94 insertions(+)

Comments

Kanchan Joshi Sept. 11, 2024, 3:50 p.m. UTC | #1
On 9/11/2024 12:18 AM, Jens Axboe wrote:
> On 9/10/24 9:01 AM, Kanchan Joshi wrote:
>> +static inline bool rw_placement_hint_valid(u64 val)
>> +{
>> +	if (val <= MAX_PLACEMENT_HINT_VAL)
>> +		return true;
>> +
>> +	return false;
>> +}
> Nit, why not just:
> 
> static inline bool rw_placement_hint_valid(u64 val)
> {
> 	return val <= MAX_PLACEMENT_HINT_VAL;
> }
> 

Right, concise.
I can fold in both the changes in next respin.
Christoph Hellwig Sept. 12, 2024, 1:01 p.m. UTC | #2
On Tue, Sep 10, 2024 at 08:31:58PM +0530, Kanchan Joshi wrote:
> This is similar to existing F_{SET/GET}_RW_HINT but more
> generic/extensible.
> 
> F_SET/GET_RW_HINT_EX take a pointer to a struct rw_hint_ex as argument:
> 
> struct rw_hint_ex {
>         __u8    type;
>         __u8    pad[7];
>         __u64   val;
> };
> 
> With F_SET_RW_HINT_EX, the user passes the hint type and its value.
> Hint type can be either lifetime hint (TYPE_RW_LIFETIME_HINT) or
> placement hint (TYPE_RW_PLACEMENT_HINT). The interface allows to add
> more hint add more hint types in future.

What is the point of multiplexing these into a single call vs having
one fcntl for each?  It's not like the code points are a super
limited resource.

And the _EX name isn't exactly descriptive either and screams of horrible
Windows APIs :)

> +	WRITE_ONCE(inode->i_write_hint, hint);
> +	if (file->f_mapping->host != inode)
> +		WRITE_ONCE(file->f_mapping->host->i_write_hint, hint);

This doesn't work.  You need a file system method for this so that
the file system can intercept it, instead of storing it in completely
arbitrary inodes without any kind of checking for support or intercetion
point.

> --- a/include/linux/rw_hint.h
> +++ b/include/linux/rw_hint.h
> @@ -21,4 +21,17 @@ enum rw_lifetime_hint {
>  static_assert(sizeof(enum rw_lifetime_hint) == 1);
>  #endif
>  
> +#define WRITE_HINT_TYPE_BIT	BIT(7)
> +#define WRITE_HINT_VAL_MASK	(WRITE_HINT_TYPE_BIT - 1)
> +#define WRITE_HINT_TYPE(h)	(((h) & WRITE_HINT_TYPE_BIT) ? \
> +				TYPE_RW_PLACEMENT_HINT : TYPE_RW_LIFETIME_HINT)
> +#define WRITE_HINT_VAL(h)	((h) & WRITE_HINT_VAL_MASK)
> +
> +#define WRITE_PLACEMENT_HINT(h)	(((h) & WRITE_HINT_TYPE_BIT) ? \
> +				 WRITE_HINT_VAL(h) : 0)
> +#define WRITE_LIFETIME_HINT(h)	(((h) & WRITE_HINT_TYPE_BIT) ? \
> +				 0 : WRITE_HINT_VAL(h))
> +
> +#define PLACEMENT_HINT_TYPE	WRITE_HINT_TYPE_BIT
> +#define MAX_PLACEMENT_HINT_VAL	(WRITE_HINT_VAL_MASK - 1)

That's a whole lot of undocumented macros.  Please turn these into proper
inline functions and write documentation for them.
Kanchan Joshi Sept. 12, 2024, 3:53 p.m. UTC | #3
On 9/12/2024 6:31 PM, Christoph Hellwig wrote:
> On Tue, Sep 10, 2024 at 08:31:58PM +0530, Kanchan Joshi wrote:
>> This is similar to existing F_{SET/GET}_RW_HINT but more
>> generic/extensible.
>>
>> F_SET/GET_RW_HINT_EX take a pointer to a struct rw_hint_ex as argument:
>>
>> struct rw_hint_ex {
>>          __u8    type;
>>          __u8    pad[7];
>>          __u64   val;
>> };
>>
>> With F_SET_RW_HINT_EX, the user passes the hint type and its value.
>> Hint type can be either lifetime hint (TYPE_RW_LIFETIME_HINT) or
>> placement hint (TYPE_RW_PLACEMENT_HINT). The interface allows to add
>> more hint add more hint types in future.
> 
> What is the point of multiplexing these into a single call vs having
> one fcntl for each?  It's not like the code points are a super
> limited resource.

Do you mean new fcntl code only for placement hint?
I thought folks will prefer the user-interface to be future proof so 
that they don't have to add a new fcntl opcode.
Had the existing fcntl accepted "hint type" as argument, I would not 
have resorted to add a new one now.

You may have noticed that in io_uring metadata series also, even though 
current meta type is 'integrity', we allow user interface to express 
other types of metadata too.

> And the _EX name isn't exactly descriptive either and screams of horrible
> Windows APIs :)

I can change to what you prefer.
But my inspiration behind this name was Linux F_GET/SET_OWN_EX (which is 
revamped version of F_GET/SET_OWN).

>> +	WRITE_ONCE(inode->i_write_hint, hint);
>> +	if (file->f_mapping->host != inode)
>> +		WRITE_ONCE(file->f_mapping->host->i_write_hint, hint);
> 
> This doesn't work.  You need a file system method for this so that
> the file system can intercept it, instead of storing it in completely
> arbitrary inodes without any kind of checking for support or intercetion
> point.
> 

I don't understand why will it not work. The hint is being set in the 
same way how it is done in the current code (in existing fcntl handlers 
for temperature hints).

>> --- a/include/linux/rw_hint.h
>> +++ b/include/linux/rw_hint.h
>> @@ -21,4 +21,17 @@ enum rw_lifetime_hint {
>>   static_assert(sizeof(enum rw_lifetime_hint) == 1);
>>   #endif
>>   
>> +#define WRITE_HINT_TYPE_BIT	BIT(7)
>> +#define WRITE_HINT_VAL_MASK	(WRITE_HINT_TYPE_BIT - 1)
>> +#define WRITE_HINT_TYPE(h)	(((h) & WRITE_HINT_TYPE_BIT) ? \
>> +				TYPE_RW_PLACEMENT_HINT : TYPE_RW_LIFETIME_HINT)
>> +#define WRITE_HINT_VAL(h)	((h) & WRITE_HINT_VAL_MASK)
>> +
>> +#define WRITE_PLACEMENT_HINT(h)	(((h) & WRITE_HINT_TYPE_BIT) ? \
>> +				 WRITE_HINT_VAL(h) : 0)
>> +#define WRITE_LIFETIME_HINT(h)	(((h) & WRITE_HINT_TYPE_BIT) ? \
>> +				 0 : WRITE_HINT_VAL(h))
>> +
>> +#define PLACEMENT_HINT_TYPE	WRITE_HINT_TYPE_BIT
>> +#define MAX_PLACEMENT_HINT_VAL	(WRITE_HINT_VAL_MASK - 1)
> 
> That's a whole lot of undocumented macros.  Please turn these into proper
> inline functions and write documentation for them.

I can try doing that.
Bart Van Assche Sept. 12, 2024, 8:36 p.m. UTC | #4
On 9/10/24 8:01 AM, Kanchan Joshi wrote:
> diff --git a/include/linux/rw_hint.h b/include/linux/rw_hint.h
> index b9942f5f13d3..ff708a75e2f6 100644
> --- a/include/linux/rw_hint.h
> +++ b/include/linux/rw_hint.h
> @@ -21,4 +21,17 @@ enum rw_lifetime_hint {
>   static_assert(sizeof(enum rw_lifetime_hint) == 1);
>   #endif
>   
> +#define WRITE_HINT_TYPE_BIT	BIT(7)
> +#define WRITE_HINT_VAL_MASK	(WRITE_HINT_TYPE_BIT - 1)
> +#define WRITE_HINT_TYPE(h)	(((h) & WRITE_HINT_TYPE_BIT) ? \
> +				TYPE_RW_PLACEMENT_HINT : TYPE_RW_LIFETIME_HINT)
> +#define WRITE_HINT_VAL(h)	((h) & WRITE_HINT_VAL_MASK)
> +
> +#define WRITE_PLACEMENT_HINT(h)	(((h) & WRITE_HINT_TYPE_BIT) ? \
> +				 WRITE_HINT_VAL(h) : 0)
> +#define WRITE_LIFETIME_HINT(h)	(((h) & WRITE_HINT_TYPE_BIT) ? \
> +				 0 : WRITE_HINT_VAL(h))
> +
> +#define PLACEMENT_HINT_TYPE	WRITE_HINT_TYPE_BIT
> +#define MAX_PLACEMENT_HINT_VAL	(WRITE_HINT_VAL_MASK - 1)
>   #endif /* _LINUX_RW_HINT_H */

The above macros implement a union of two 7-bit types in an 8-bit field.
Wouldn't we be better of by using two separate 8-bit values such that we
don't need the above macros?

Thanks,

Bart.
Kanchan Joshi Sept. 13, 2024, 7:15 a.m. UTC | #5
On 9/13/2024 2:06 AM, Bart Van Assche wrote:
> On 9/10/24 8:01 AM, Kanchan Joshi wrote:
>> diff --git a/include/linux/rw_hint.h b/include/linux/rw_hint.h
>> index b9942f5f13d3..ff708a75e2f6 100644
>> --- a/include/linux/rw_hint.h
>> +++ b/include/linux/rw_hint.h
>> @@ -21,4 +21,17 @@ enum rw_lifetime_hint {
>>   static_assert(sizeof(enum rw_lifetime_hint) == 1);
>>   #endif
>> +#define WRITE_HINT_TYPE_BIT    BIT(7)
>> +#define WRITE_HINT_VAL_MASK    (WRITE_HINT_TYPE_BIT - 1)
>> +#define WRITE_HINT_TYPE(h)    (((h) & WRITE_HINT_TYPE_BIT) ? \
>> +                TYPE_RW_PLACEMENT_HINT : TYPE_RW_LIFETIME_HINT)
>> +#define WRITE_HINT_VAL(h)    ((h) & WRITE_HINT_VAL_MASK)
>> +
>> +#define WRITE_PLACEMENT_HINT(h)    (((h) & WRITE_HINT_TYPE_BIT) ? \
>> +                 WRITE_HINT_VAL(h) : 0)
>> +#define WRITE_LIFETIME_HINT(h)    (((h) & WRITE_HINT_TYPE_BIT) ? \
>> +                 0 : WRITE_HINT_VAL(h))
>> +
>> +#define PLACEMENT_HINT_TYPE    WRITE_HINT_TYPE_BIT
>> +#define MAX_PLACEMENT_HINT_VAL    (WRITE_HINT_VAL_MASK - 1)
>>   #endif /* _LINUX_RW_HINT_H */
> 
> The above macros implement a union of two 7-bit types in an 8-bit field.
> Wouldn't we be better of by using two separate 8-bit values such that we
> don't need the above macros?

I had considered that, but it requires two bytes of space. In inode, 
bio, and request.
For example this change in inode:

@@ -674,7 +674,13 @@ struct inode {
         spinlock_t              i_lock; /* i_blocks, i_bytes, maybe 
i_size */
         unsigned short          i_bytes;
         u8                      i_blkbits;
-       u8                      i_write_hint;
+       union {
+               struct {
+                       enum rw_liftime_hint lifetime_hint;
+                       u8 placement_hint;
+               };
+               u16 i_write_hint;
+       };

With this, generic propagation code will continue to use 
inode->i_write_hint. And specific places (that care) can use either 
lifetime_hint or placement_hint.

That kills the need of type-bit and above macros, but we don't have the 
two bytes of space currently.
diff mbox series

Patch

diff --git a/fs/fcntl.c b/fs/fcntl.c
index 9df35e7ff754..b35aec56981a 100644
--- a/fs/fcntl.c
+++ b/fs/fcntl.c
@@ -291,6 +291,14 @@  static bool rw_lifetime_hint_valid(u64 hint)
 	}
 }
 
+static inline bool rw_placement_hint_valid(u64 val)
+{
+	if (val <= MAX_PLACEMENT_HINT_VAL)
+		return true;
+
+	return false;
+}
+
 static long fcntl_get_rw_lifetime_hint(struct file *file, unsigned int cmd,
 			      unsigned long arg)
 {
@@ -327,6 +335,59 @@  static long fcntl_set_rw_lifetime_hint(struct file *file, unsigned int cmd,
 	return 0;
 }
 
+static long fcntl_get_rw_hint_ex(struct file *file, unsigned int cmd,
+			      unsigned long arg)
+{
+	struct rw_hint_ex __user *rw_hint_ex_p = (void __user *)arg;
+	struct rw_hint_ex rwh = {};
+	struct inode *inode = file_inode(file);
+	u8 hint = READ_ONCE(inode->i_write_hint);
+
+	rwh.type = WRITE_HINT_TYPE(hint);
+	rwh.val = WRITE_HINT_VAL(hint);
+
+	if (copy_to_user(rw_hint_ex_p, &rwh, sizeof(rwh)))
+		return -EFAULT;
+
+	return 0;
+}
+
+static long fcntl_set_rw_hint_ex(struct file *file, unsigned int cmd,
+			      unsigned long arg)
+{
+	struct rw_hint_ex __user *rw_hint_ex_p = (void __user *)arg;
+	struct rw_hint_ex rwh;
+	struct inode *inode = file_inode(file);
+	u64 hint;
+	int i;
+
+	if (copy_from_user(&rwh, rw_hint_ex_p, sizeof(rwh)))
+		return -EFAULT;
+	for (i = 0; i < ARRAY_SIZE(rwh.pad); i++)
+		if (rwh.pad[i])
+			return -EINVAL;
+	switch (rwh.type) {
+	case TYPE_RW_LIFETIME_HINT:
+		if (!rw_lifetime_hint_valid(rwh.val))
+			return -EINVAL;
+		hint = rwh.val;
+		break;
+	case TYPE_RW_PLACEMENT_HINT:
+		if (!rw_placement_hint_valid(rwh.val))
+			return -EINVAL;
+		hint = PLACEMENT_HINT_TYPE | rwh.val;
+		break;
+	default:
+		return -EINVAL;
+	}
+
+	WRITE_ONCE(inode->i_write_hint, hint);
+	if (file->f_mapping->host != inode)
+		WRITE_ONCE(file->f_mapping->host->i_write_hint, hint);
+
+	return 0;
+}
+
 /* Is the file descriptor a dup of the file? */
 static long f_dupfd_query(int fd, struct file *filp)
 {
@@ -454,6 +515,12 @@  static long do_fcntl(int fd, unsigned int cmd, unsigned long arg,
 	case F_SET_RW_HINT:
 		err = fcntl_set_rw_lifetime_hint(filp, cmd, arg);
 		break;
+	case F_GET_RW_HINT_EX:
+		err = fcntl_get_rw_hint_ex(filp, cmd, arg);
+		break;
+	case F_SET_RW_HINT_EX:
+		err = fcntl_set_rw_hint_ex(filp, cmd, arg);
+		break;
 	default:
 		break;
 	}
diff --git a/include/linux/rw_hint.h b/include/linux/rw_hint.h
index b9942f5f13d3..ff708a75e2f6 100644
--- a/include/linux/rw_hint.h
+++ b/include/linux/rw_hint.h
@@ -21,4 +21,17 @@  enum rw_lifetime_hint {
 static_assert(sizeof(enum rw_lifetime_hint) == 1);
 #endif
 
+#define WRITE_HINT_TYPE_BIT	BIT(7)
+#define WRITE_HINT_VAL_MASK	(WRITE_HINT_TYPE_BIT - 1)
+#define WRITE_HINT_TYPE(h)	(((h) & WRITE_HINT_TYPE_BIT) ? \
+				TYPE_RW_PLACEMENT_HINT : TYPE_RW_LIFETIME_HINT)
+#define WRITE_HINT_VAL(h)	((h) & WRITE_HINT_VAL_MASK)
+
+#define WRITE_PLACEMENT_HINT(h)	(((h) & WRITE_HINT_TYPE_BIT) ? \
+				 WRITE_HINT_VAL(h) : 0)
+#define WRITE_LIFETIME_HINT(h)	(((h) & WRITE_HINT_TYPE_BIT) ? \
+				 0 : WRITE_HINT_VAL(h))
+
+#define PLACEMENT_HINT_TYPE	WRITE_HINT_TYPE_BIT
+#define MAX_PLACEMENT_HINT_VAL	(WRITE_HINT_VAL_MASK - 1)
 #endif /* _LINUX_RW_HINT_H */
diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h
index c0bcc185fa48..f758a7230419 100644
--- a/include/uapi/linux/fcntl.h
+++ b/include/uapi/linux/fcntl.h
@@ -57,6 +57,8 @@ 
 #define F_SET_RW_HINT		(F_LINUX_SPECIFIC_BASE + 12)
 #define F_GET_FILE_RW_HINT	(F_LINUX_SPECIFIC_BASE + 13)
 #define F_SET_FILE_RW_HINT	(F_LINUX_SPECIFIC_BASE + 14)
+#define F_GET_RW_HINT_EX	(F_LINUX_SPECIFIC_BASE + 15)
+#define F_SET_RW_HINT_EX	(F_LINUX_SPECIFIC_BASE + 16)
 
 /*
  * Valid hint values for F_{GET,SET}_RW_HINT. 0 is "not set", or can be
@@ -76,6 +78,18 @@ 
  */
 #define RWF_WRITE_LIFE_NOT_SET	RWH_WRITE_LIFE_NOT_SET
 
+enum rw_hint_type {
+	TYPE_RW_LIFETIME_HINT = 1,
+	TYPE_RW_PLACEMENT_HINT
+};
+
+/* Exchange information with F_{GET/SET}_RW_HINT fcntl */
+struct rw_hint_ex {
+	__u8	type;
+	__u8	pad[7];
+	__u64	val;
+};
+
 /*
  * Types of directory notifications that may be requested.
  */