mbox series

[v7,00/11] hw/block/nvme: Support Namespace Types and Zoned Namespace Command Set

Message ID 20201019021726.12048-1-dmitry.fomichev@wdc.com
Headers show
Series hw/block/nvme: Support Namespace Types and Zoned Namespace Command Set | expand

Message

Dmitry Fomichev Oct. 19, 2020, 2:17 a.m. UTC
v6 -> v7:

 - Introduce ns->iocs initialization function earlier in the series,
   in CSE Log patch.

 - Set NVM iocs for zoned namespaces when CC.CSS is set to
   NVME_CC_CSS_NVM.

 - Clean up code in CSE log handler.
 
v5 -> v6:

 - Remove zoned state persistence code. Replace position-independent
   zone lists with QTAILQs.

 - Close all open zones upon clearing of the controller. This is
   a similar procedure to the one previously performed upon powering
   up with zone persistence. 

 - Squash NS Types and ZNS triplets of commits to keep definitions
   and trace event definitions together with the implementation code.

 - Move namespace UUID generation to a separate patch. Add the new
   "uuid" property as suggested by Klaus.

 - Rework Commands and Effects patch to make sure that the log is
   always in sync with the actual set of commands supported.

 - Add two refactoring commits at the end of the series to
   optimize read and write i/o path.

- Incorporate feedback from Keith, Klaus and Niklas:

  * fix rebase errors in nvme_identify_ns_descr_list()
  * remove unnecessary code from nvme_write_bar()
  * move csi to NvmeNamespace and use it from the beginning in NSTypes
    patch
  * change zone read processing to cover all corner cases with RAZB=1
  * sync w_ptr and d.wp in case of a i/o error at the preceding zone
  * reword the commit message in active/inactive patch with the new
    text from Niklas
  * correct dlfeat reporting depending on the fill pattern set
  * add more checks for "attached" n/s parameter to prevent i/o and
    get/set features on inactive namespaces
  * Use DEFINE_PROP_SIZE and DEFINE_PROP_SIZE32 for zone size/capacity
    and ZASL respectively
  * Improve zone size and capacity validation
  * Correctly report NSZE

v4 -> v5:

 - Rebase to the current qemu-nvme.

 - Use HostMemoryBackendFile as the backing storage for persistent
   zone metadata.

 - Fix the issue with filling the valid data in the next zone if RAZBi
   is enabled.

v3 -> v4:

 - Fix bugs introduced in v2/v3 for QD > 1 operation. Now, all writes
   to a zone happen at the new write pointer variable, zone->w_ptr,
   that is advanced right after submitting the backend i/o. The existing
   zone->d.wp variable is updated upon the successful write completion
   and it is used for zone reporting. Some code has been split from
   nvme_finalize_zoned_write() function to a new function,
   nvme_advance_zone_wp().

 - Make the code compile under mingw. Switch to using QEMU API for
   mmap/msync, i.e. memory_region...(). Since mmap is not available in
   mingw (even though there is mman-win32 library available on Github),
   conditional compilation is added around these calls to avoid
   undefined symbols under mingw. A better fix would be to add stub
   functions to softmmu/memory.c for the case when CONFIG_POSIX is not
   defined, but such change is beyond the scope of this patchset and it
   can be made in a separate patch.

 - Correct permission mask used to open zone metadata file.

 - Fold "Define 64 bit cqe.result" patch into ZNS commit.

 - Use clz64/clz32 instead of defining nvme_ilog2() function.

 - Simplify rpt_empty_id_struct() code, move nvme_fill_data() back
   to ZNS patch.

 - Fix a power-on processing bug.

 - Rename NVME_CMD_ZONE_APND to NVME_CMD_ZONE_APPEND.

 - Make the list of review comments addressed in v2 of the series
   (see below).

v2 -> v3:

 - Moved nvme_fill_data() function to the NSTypes patch as it is
   now used there to output empty namespace identify structs.
 - Fixed typo in Maxim's email address.

v1 -> v2:

 - Rebased on top of qemu-nvme/next branch.
 - Incorporated feedback from Klaus and Alistair.
    * Allow a subset of CSE log to be read, not the entire log
    * Assign admin command entries in CSE log to ACS fields
    * Set LPA bit 1 to indicate support of CSE log page
    * Rename CC.CSS value CSS_ALL_NSTYPES (110b) to CSS_CSI
    * Move the code to assign lbaf.ds to a separate patch
    * Remove the change in firmware revision
    * Change "driver" to "device" in comments and annotations
    * Rename ZAMDS to ZASL
    * Correct a few format expressions and some wording in
      trace event definitions
    * Remove validation code to return NVME_CAP_EXCEEDED error
    * Make ZASL to be equal to MDTS if "zone_append_size_limit"
      module parameter is not set
    * Clean up nvme_zoned_init_ctrl() to make size calculations
      less confusing
    * Avoid changing module parameters, use separate n/s variables
      if additional calculations are necessary to convert parameters
      to running values
    * Use NVME_DEFAULT_ZONE_SIZE to assign the default zone size value
    * Use default 0 for zone capacity meaning that zone capacity will
      be equal to zone size by default
    * Issue warnings if user MAR/MOR values are too large and have
      to be adjusted
    * Use unsigned values for MAR/MOR
 - Dropped "Simulate Zone Active excursions" patch.
   Excursion behavior may depend on the internal controller
   architecture and therefore be vendor-specific.
 - Dropped support for Zone Attributes and zoned AENs for now.
   These features can be added in a future series.
 - NS Types support is extended to handle active/inactive namespaces.
 - Update the write pointer after backing storage I/O completion, not
   before. This makes the emulation to run correctly in case of
   backing device failures.
 - Avoid division in the I/O path if the device zone size is
   a power of two (the most common case). Zone index then can be
   calculated by using bit shift.
 - A few reported bugs have been fixed.
 - Indentation in function definitions has been changed to make it
   the same as the rest of the code.


Zoned Namespace (ZNS) Command Set is a newly introduced command set
published by the NVM Express, Inc. organization as TP 4053. The main
design goals of ZNS are to provide hardware designers the means to
reduce NVMe controller complexity and to allow achieving a better I/O
latency and throughput. SSDs that implement this interface are
commonly known as ZNS SSDs.

This command set is implementing a zoned storage model, similarly to
ZAC/ZBC. As such, there is already support in Linux, allowing one to
perform the majority of tasks needed for managing ZNS SSDs.

The Zoned Namespace Command Set relies on another TP, known as
Namespace Types (NVMe TP 4056), which introduces support for having
multiple command sets per namespace.

Both ZNS and Namespace Types specifications can be downloaded by
visiting the following link -

https://nvmexpress.org/wp-content/uploads/NVM-Express-1.4-Ratified-TPs.zip

This patch series adds Namespace Types support and zoned namespace
emulation capability to the existing NVMe PCI device.

Based-on: <20201013174826.GA1049145@dhcp-10-100-145-180.wdl.wdc.com>

Dmitry Fomichev (9):
  hw/block/nvme: Add Commands Supported and Effects log
  hw/block/nvme: Generate namespace UUIDs
  hw/block/nvme: Support Zoned Namespace Command Set
  hw/block/nvme: Introduce max active and open zone limits
  hw/block/nvme: Support Zone Descriptor Extensions
  hw/block/nvme: Add injection of Offline/Read-Only zones
  hw/block/nvme: Document zoned parameters in usage text
  hw/block/nvme: Separate read and write handlers
  hw/block/nvme: Merge nvme_write_zeroes() with nvme_write()

Niklas Cassel (2):
  hw/block/nvme: Add support for Namespace Types
  hw/block/nvme: Support allocated CNS command variants

 block/nvme.c          |    2 +-
 hw/block/nvme-ns.c    |  295 ++++++++
 hw/block/nvme-ns.h    |  109 +++
 hw/block/nvme.c       | 1550 ++++++++++++++++++++++++++++++++++++++---
 hw/block/nvme.h       |    9 +
 hw/block/trace-events |   36 +-
 include/block/nvme.h  |  201 +++++-
 7 files changed, 2078 insertions(+), 124 deletions(-)

Comments

Niklas Cassel Oct. 19, 2020, 7:32 a.m. UTC | #1
On Mon, Oct 19, 2020 at 11:17:15AM +0900, Dmitry Fomichev wrote:

(snip)

> 
> Dmitry Fomichev (9):
>   hw/block/nvme: Add Commands Supported and Effects log
>   hw/block/nvme: Generate namespace UUIDs
>   hw/block/nvme: Support Zoned Namespace Command Set
>   hw/block/nvme: Introduce max active and open zone limits
>   hw/block/nvme: Support Zone Descriptor Extensions
>   hw/block/nvme: Add injection of Offline/Read-Only zones
>   hw/block/nvme: Document zoned parameters in usage text
>   hw/block/nvme: Separate read and write handlers
>   hw/block/nvme: Merge nvme_write_zeroes() with nvme_write()
> 
> Niklas Cassel (2):
>   hw/block/nvme: Add support for Namespace Types
>   hw/block/nvme: Support allocated CNS command variants
> 
>  block/nvme.c          |    2 +-
>  hw/block/nvme-ns.c    |  295 ++++++++
>  hw/block/nvme-ns.h    |  109 +++
>  hw/block/nvme.c       | 1550 ++++++++++++++++++++++++++++++++++++++---
>  hw/block/nvme.h       |    9 +
>  hw/block/trace-events |   36 +-
>  include/block/nvme.h  |  201 +++++-
>  7 files changed, 2078 insertions(+), 124 deletions(-)
> 
> -- 
> 2.21.0
> 

Thank you Dmitry, this version was easier to review.

Except for a missing
/* fall through */ comment in nvme_cmd_effects().
(in the "hw/block/nvme: Add Commands Supported and Effects log" patch.)

For the whole series:
Reviewed-by: Niklas Cassel <niklas.cassel@wdc.com>
Klaus Jensen Oct. 19, 2020, 12:33 p.m. UTC | #2
On Oct 19 11:17, Dmitry Fomichev wrote:
> diff --git a/hw/block/nvme-ns.h b/hw/block/nvme-ns.h
> index d6b2808b97..170cbb8cdc 100644
> --- a/hw/block/nvme-ns.h
> +++ b/hw/block/nvme-ns.h
> @@ -34,6 +45,18 @@ typedef struct NvmeNamespace {
>      const uint32_t *iocs;
>      uint8_t      csi;
>  
> +    NvmeIdNsZoned   *id_ns_zoned;
> +    NvmeZone        *zone_array;
> +    QTAILQ_HEAD(, NvmeZone) exp_open_zones;
> +    QTAILQ_HEAD(, NvmeZone) imp_open_zones;
> +    QTAILQ_HEAD(, NvmeZone) closed_zones;
> +    QTAILQ_HEAD(, NvmeZone) full_zones;

Apart from the imp_open_zones list that is being used in a later patch
to support Implicitly Opened to Closed transitions, these lists seem
rather pointless. As far as I can tell the only use they have is being
inserted into, removed from and checking if a zone is in one of those
four states?

The Zone Management Receive (and Send with Select All) is just iterating
on all zones and matching on state.
Keith Busch Oct. 19, 2020, 8:07 p.m. UTC | #3
On Mon, Oct 19, 2020 at 11:17:19AM +0900, Dmitry Fomichev wrote:
> Add a new Boolean namespace property, "attached", to provide the most
> basic namespace attachment support. The default value for this new
> property is true. Also, implement the logic in the new CNS values to
> include/exclude namespaces based on this new property. The only thing
> missing is hooking up the actual Namespace Attachment command opcode,
> which will allow a user to toggle the "attached" flag per namespace.
> 
> The reason for not hooking up this command completely is because the
> NVMe specification requires the namespace management command to be
> supported if the namespace attachment command is supported.

Huh, the spec does require that, and that seems like an odd requirement
since it prevents dynamic namespace attach states in a static namespace
setup. I'm not sure why the spec assumes those two things go together,
but it sure enough does!

The implementation looks fine.

Reviewed-by: Keith Busch <kbusch@kernel.org>
Klaus Jensen Oct. 19, 2020, 8:16 p.m. UTC | #4
On Oct 19 11:17, Dmitry Fomichev wrote:
> This log page becomes necessary to implement to allow checking for
> Zone Append command support in Zoned Namespace Command Set.
> 
> This commit adds the code to report this log page for NVM Command
> Set only. The parts that are specific to zoned operation will be
> added later in the series.
> 
> All incoming admin and i/o commands are now only processed if their
> corresponding support bits are set in this log. This provides an
> easy way to control what commands to support and what not to
> depending on set CC.CSS.
> 
> Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
> ---
>  hw/block/nvme-ns.h    |  1 +
>  hw/block/nvme.c       | 98 +++++++++++++++++++++++++++++++++++++++----
>  hw/block/trace-events |  2 +
>  include/block/nvme.h  | 19 +++++++++
>  4 files changed, 111 insertions(+), 9 deletions(-)
> 
> diff --git a/hw/block/nvme-ns.h b/hw/block/nvme-ns.h
> index 83734f4606..ea8c2f785d 100644
> --- a/hw/block/nvme-ns.h
> +++ b/hw/block/nvme-ns.h
> @@ -29,6 +29,7 @@ typedef struct NvmeNamespace {
>      int32_t      bootindex;
>      int64_t      size;
>      NvmeIdNs     id_ns;
> +    const uint32_t *iocs;
>  
>      NvmeNamespaceParams params;
>  } NvmeNamespace;
> diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> index 9d30ca69dc..5a9493d89f 100644
> --- a/hw/block/nvme.c
> +++ b/hw/block/nvme.c
> @@ -111,6 +111,28 @@ static const uint32_t nvme_feature_cap[NVME_FID_MAX] = {
>      [NVME_TIMESTAMP]                = NVME_FEAT_CAP_CHANGE,
>  };
>  
> +static const uint32_t nvme_cse_acs[256] = {
> +    [NVME_ADM_CMD_DELETE_SQ]        = NVME_CMD_EFF_CSUPP,
> +    [NVME_ADM_CMD_CREATE_SQ]        = NVME_CMD_EFF_CSUPP,
> +    [NVME_ADM_CMD_DELETE_CQ]        = NVME_CMD_EFF_CSUPP,
> +    [NVME_ADM_CMD_CREATE_CQ]        = NVME_CMD_EFF_CSUPP,
> +    [NVME_ADM_CMD_IDENTIFY]         = NVME_CMD_EFF_CSUPP,
> +    [NVME_ADM_CMD_SET_FEATURES]     = NVME_CMD_EFF_CSUPP,
> +    [NVME_ADM_CMD_GET_FEATURES]     = NVME_CMD_EFF_CSUPP,
> +    [NVME_ADM_CMD_GET_LOG_PAGE]     = NVME_CMD_EFF_CSUPP,
> +    [NVME_ADM_CMD_ASYNC_EV_REQ]     = NVME_CMD_EFF_CSUPP,
> +};

NVME_ADM_CMD_ABORT is missing. And since you added a (redundant) check
in nvme_admin_cmd that cheks this table, Abort is now an invalid
command.

Also, can you reorder it according to opcode instead of
pseudo-lexicographically?

> +
> +static const uint32_t nvme_cse_iocs_none[256] = {
> +};

[-pedantic] no need for the '= {}'

> +
> +static const uint32_t nvme_cse_iocs_nvm[256] = {
> +    [NVME_CMD_FLUSH]                = NVME_CMD_EFF_CSUPP | NVME_CMD_EFF_LBCC,
> +    [NVME_CMD_WRITE_ZEROES]         = NVME_CMD_EFF_CSUPP | NVME_CMD_EFF_LBCC,
> +    [NVME_CMD_WRITE]                = NVME_CMD_EFF_CSUPP | NVME_CMD_EFF_LBCC,
> +    [NVME_CMD_READ]                 = NVME_CMD_EFF_CSUPP,
> +};
> +
>  static void nvme_process_sq(void *opaque);
>  
>  static uint16_t nvme_cid(NvmeRequest *req)
> @@ -1032,10 +1054,6 @@ static uint16_t nvme_io_cmd(NvmeCtrl *n, NvmeRequest *req)
>      trace_pci_nvme_io_cmd(nvme_cid(req), nsid, nvme_sqid(req),
>                            req->cmd.opcode, nvme_io_opc_str(req->cmd.opcode));
>  
> -    if (NVME_CC_CSS(n->bar.cc) == NVME_CC_CSS_ADMIN_ONLY) {
> -        return NVME_INVALID_OPCODE | NVME_DNR;
> -    }
> -

I would assume the device to respond with invalid opcode before
validating the nsid if it is an admin only device.

>      if (!nvme_nsid_valid(n, nsid)) {
>          return NVME_INVALID_NSID | NVME_DNR;
>      }
> @@ -1045,6 +1063,11 @@ static uint16_t nvme_io_cmd(NvmeCtrl *n, NvmeRequest *req)
>          return NVME_INVALID_FIELD | NVME_DNR;
>      }
>  
> +    if (!(req->ns->iocs[req->cmd.opcode] & NVME_CMD_EFF_CSUPP)) {
> +        trace_pci_nvme_err_invalid_opc(req->cmd.opcode);
> +        return NVME_INVALID_OPCODE | NVME_DNR;
> +    }
> +
>      switch (req->cmd.opcode) {
>      case NVME_CMD_FLUSH:
>          return nvme_flush(n, req);
> @@ -1054,8 +1077,7 @@ static uint16_t nvme_io_cmd(NvmeCtrl *n, NvmeRequest *req)
>      case NVME_CMD_READ:
>          return nvme_rw(n, req);
>      default:
> -        trace_pci_nvme_err_invalid_opc(req->cmd.opcode);
> -        return NVME_INVALID_OPCODE | NVME_DNR;
> +        assert(false);
>      }
>  }
>  
> @@ -1291,6 +1313,39 @@ static uint16_t nvme_error_info(NvmeCtrl *n, uint8_t rae, uint32_t buf_len,
>                      DMA_DIRECTION_FROM_DEVICE, req);
>  }
>  
> +static uint16_t nvme_cmd_effects(NvmeCtrl *n, uint32_t buf_len,
> +                                 uint64_t off, NvmeRequest *req)
> +{
> +    NvmeEffectsLog log = {};

[-pedantic] and empty initializer list is not allowed, should be '{0}'.

> +    const uint32_t *src_iocs = NULL;
> +    uint32_t trans_len;
> +
> +    trace_pci_nvme_cmd_supp_and_effects_log_read();

This has just been traced in nvme_admin_cmd and this doesn't add any
additional info.

> +
> +    if (off >= sizeof(log)) {
> +        trace_pci_nvme_err_invalid_effects_log_offset(off);

Can we do `trace_pci_nvme_err_invalid_log_page_offset(off) instead? Then
we can easily reuse it in the other log pages.

> +        return NVME_INVALID_FIELD | NVME_DNR;
> +    }
> +
> +    switch (NVME_CC_CSS(n->bar.cc)) {
> +    case NVME_CC_CSS_NVM:
> +        src_iocs = nvme_cse_iocs_nvm;
> +    case NVME_CC_CSS_ADMIN_ONLY:
> +        break;
> +    }
> +
> +    memcpy(log.acs, nvme_cse_acs, sizeof(nvme_cse_acs));
> +
> +    if (src_iocs) {
> +        memcpy(log.iocs, src_iocs, sizeof(log.iocs));
> +    }
> +
> +    trans_len = MIN(sizeof(log) - off, buf_len);
> +
> +    return nvme_dma(n, ((uint8_t *)&log) + off, trans_len,
> +                    DMA_DIRECTION_FROM_DEVICE, req);
> +}
> +
>  static uint16_t nvme_get_log(NvmeCtrl *n, NvmeRequest *req)
>  {
>      NvmeCmd *cmd = &req->cmd;
> @@ -1334,6 +1389,8 @@ static uint16_t nvme_get_log(NvmeCtrl *n, NvmeRequest *req)
>          return nvme_smart_info(n, rae, len, off, req);
>      case NVME_LOG_FW_SLOT_INFO:
>          return nvme_fw_log_info(n, len, off, req);
> +    case NVME_LOG_CMD_EFFECTS:
> +        return nvme_cmd_effects(n, len, off, req);
>      default:
>          trace_pci_nvme_err_invalid_log_page(nvme_cid(req), lid);
>          return NVME_INVALID_FIELD | NVME_DNR;
> @@ -1920,6 +1977,11 @@ static uint16_t nvme_admin_cmd(NvmeCtrl *n, NvmeRequest *req)
>      trace_pci_nvme_admin_cmd(nvme_cid(req), nvme_sqid(req), req->cmd.opcode,
>                               nvme_adm_opc_str(req->cmd.opcode));
>  
> +    if (!(nvme_cse_acs[req->cmd.opcode] & NVME_CMD_EFF_CSUPP)) {
> +        trace_pci_nvme_err_invalid_admin_opc(req->cmd.opcode);
> +        return NVME_INVALID_OPCODE | NVME_DNR;
> +    }
> +

This is the (redundant) check that effectively makes Abort an invalid
command.

>      switch (req->cmd.opcode) {
>      case NVME_ADM_CMD_DELETE_SQ:
>          return nvme_del_sq(n, req);
> @@ -1942,8 +2004,7 @@ static uint16_t nvme_admin_cmd(NvmeCtrl *n, NvmeRequest *req)
>      case NVME_ADM_CMD_ASYNC_EV_REQ:
>          return nvme_aer(n, req);
>      default:
> -        trace_pci_nvme_err_invalid_admin_opc(req->cmd.opcode);
> -        return NVME_INVALID_OPCODE | NVME_DNR;
> +        assert(false);
>      }
>  }
>  
> @@ -2031,6 +2092,23 @@ static void nvme_clear_ctrl(NvmeCtrl *n)
>      n->bar.cc = 0;
>  }
>  
> +static void nvme_select_ns_iocs(NvmeCtrl *n)
> +{
> +    NvmeNamespace *ns;
> +    int i;
> +
> +    for (i = 1; i <= n->num_namespaces; i++) {
> +        ns = nvme_ns(n, i);
> +        if (!ns) {
> +            continue;
> +        }
> +        ns->iocs = nvme_cse_iocs_none;
> +        if (NVME_CC_CSS(n->bar.cc) != NVME_CC_CSS_ADMIN_ONLY) {
> +            ns->iocs = nvme_cse_iocs_nvm;
> +        }
> +    }
> +}
> +
>  static int nvme_start_ctrl(NvmeCtrl *n)
>  {
>      uint32_t page_bits = NVME_CC_MPS(n->bar.cc) + 12;
> @@ -2129,6 +2207,8 @@ static int nvme_start_ctrl(NvmeCtrl *n)
>  
>      QTAILQ_INIT(&n->aer_queue);
>  
> +    nvme_select_ns_iocs(n);
> +
>      return 0;
>  }
>  
> @@ -2737,7 +2817,7 @@ static void nvme_init_ctrl(NvmeCtrl *n, PCIDevice *pci_dev)
>      id->acl = 3;
>      id->aerl = n->params.aerl;
>      id->frmw = (NVME_NUM_FW_SLOTS << 1) | NVME_FRMW_SLOT1_RO;
> -    id->lpa = NVME_LPA_NS_SMART | NVME_LPA_EXTENDED;
> +    id->lpa = NVME_LPA_NS_SMART | NVME_LPA_CSE | NVME_LPA_EXTENDED;
>  
>      /* recommended default value (~70 C) */
>      id->wctemp = cpu_to_le16(NVME_TEMPERATURE_WARNING);
> diff --git a/hw/block/trace-events b/hw/block/trace-events
> index fac5995d94..0ae9cb0d35 100644
> --- a/hw/block/trace-events
> +++ b/hw/block/trace-events
> @@ -85,6 +85,7 @@ pci_nvme_mmio_start_success(void) "setting controller enable bit succeeded"
>  pci_nvme_mmio_stopped(void) "cleared controller enable bit"
>  pci_nvme_mmio_shutdown_set(void) "shutdown bit set"
>  pci_nvme_mmio_shutdown_cleared(void) "shutdown bit cleared"
> +pci_nvme_cmd_supp_and_effects_log_read(void) "commands supported and effects log read"
>  
>  # nvme traces for error conditions
>  pci_nvme_err_mdts(uint16_t cid, size_t len) "cid %"PRIu16" len %zu"
> @@ -104,6 +105,7 @@ pci_nvme_err_invalid_prp(void) "invalid PRP"
>  pci_nvme_err_invalid_opc(uint8_t opc) "invalid opcode 0x%"PRIx8""
>  pci_nvme_err_invalid_admin_opc(uint8_t opc) "invalid admin opcode 0x%"PRIx8""
>  pci_nvme_err_invalid_lba_range(uint64_t start, uint64_t len, uint64_t limit) "Invalid LBA start=%"PRIu64" len=%"PRIu64" limit=%"PRIu64""
> +pci_nvme_err_invalid_effects_log_offset(uint64_t ofs) "commands supported and effects log offset must be 0, got %"PRIu64""
>  pci_nvme_err_invalid_del_sq(uint16_t qid) "invalid submission queue deletion, sid=%"PRIu16""
>  pci_nvme_err_invalid_create_sq_cqid(uint16_t cqid) "failed creating submission queue, invalid cqid=%"PRIu16""
>  pci_nvme_err_invalid_create_sq_sqid(uint16_t sqid) "failed creating submission queue, invalid sqid=%"PRIu16""
> diff --git a/include/block/nvme.h b/include/block/nvme.h
> index 6de2d5aa75..4779495b7d 100644
> --- a/include/block/nvme.h
> +++ b/include/block/nvme.h
> @@ -744,10 +744,27 @@ enum NvmeSmartWarn {
>      NVME_SMART_FAILED_VOLATILE_MEDIA  = 1 << 4,
>  };
>  
> +typedef struct NvmeEffectsLog {
> +    uint32_t    acs[256];
> +    uint32_t    iocs[256];
> +    uint8_t     resv[2048];
> +} NvmeEffectsLog;
> +
> +enum {
> +    NVME_CMD_EFF_CSUPP      = 1 << 0,
> +    NVME_CMD_EFF_LBCC       = 1 << 1,
> +    NVME_CMD_EFF_NCC        = 1 << 2,
> +    NVME_CMD_EFF_NIC        = 1 << 3,
> +    NVME_CMD_EFF_CCC        = 1 << 4,
> +    NVME_CMD_EFF_CSE_MASK   = 3 << 16,
> +    NVME_CMD_EFF_UUID_SEL   = 1 << 19,
> +};
> +
>  enum NvmeLogIdentifier {
>      NVME_LOG_ERROR_INFO     = 0x01,
>      NVME_LOG_SMART_INFO     = 0x02,
>      NVME_LOG_FW_SLOT_INFO   = 0x03,
> +    NVME_LOG_CMD_EFFECTS    = 0x05,
>  };
>  
>  typedef struct QEMU_PACKED NvmePSD {
> @@ -860,6 +877,7 @@ enum NvmeIdCtrlFrmw {
>  
>  enum NvmeIdCtrlLpa {
>      NVME_LPA_NS_SMART = 1 << 0,
> +    NVME_LPA_CSE      = 1 << 1,
>      NVME_LPA_EXTENDED = 1 << 2,
>  };
>  
> @@ -1059,6 +1077,7 @@ static inline void _nvme_check_size(void)
>      QEMU_BUILD_BUG_ON(sizeof(NvmeErrorLog) != 64);
>      QEMU_BUILD_BUG_ON(sizeof(NvmeFwSlotInfoLog) != 512);
>      QEMU_BUILD_BUG_ON(sizeof(NvmeSmartLog) != 512);
> +    QEMU_BUILD_BUG_ON(sizeof(NvmeEffectsLog) != 4096);
>      QEMU_BUILD_BUG_ON(sizeof(NvmeIdCtrl) != 4096);
>      QEMU_BUILD_BUG_ON(sizeof(NvmeIdNs) != 4096);
>      QEMU_BUILD_BUG_ON(sizeof(NvmeSglDescriptor) != 16);
> -- 
> 2.21.0
> 
>
Klaus Jensen Oct. 20, 2020, 8:21 a.m. UTC | #5
On Oct 19 11:17, Dmitry Fomichev wrote:

(snip)

> CAP.CSS (together with the I/O Command Set data structure) defines
> what command sets are supported by the controller.
> 
> CC.CSS (together with Set Profile) can be set to enable a subset of
> the available command sets.
> 
> Even if a user configures CC.CSS to e.g. Admin only, NVM namespaces
> will still be attached (and thus marked as active).
> Similarly, if a user configures CC.CSS to e.g. NVM, ZNS namespaces
> will still be attached (and thus marked as active).
> 
> However, any operation from a disabled command set will result in a
> Invalid Command Opcode.
> 

This part of the commit message seems irrelevant to the patch.

> Add a new Boolean namespace property, "attached", to provide the most
> basic namespace attachment support. The default value for this new
> property is true. Also, implement the logic in the new CNS values to
> include/exclude namespaces based on this new property. The only thing
> missing is hooking up the actual Namespace Attachment command opcode,
> which will allow a user to toggle the "attached" flag per namespace.
> 

Without Namespace Attachment support, the sole purpose of this parameter
is to allow unusable namespace IDs to be reported. I have no problems
with adding support for the additional CNS values. They will return
identical responses, but I think that is good enough for now.

When it is not really needed, we should be wary of adding a parameter
that is really hard to get rid of again.

> The reason for not hooking up this command completely is because the
> NVMe specification requires the namespace management command to be
> supported if the namespace attachment command is supported.
> 

There are many ways to support Namespace Management, and there are a lot
of quirks with each of them. Do we use a big blockdev and carve out
namespaces? Then, what are the semantics of an image resize operation?

Do we dynamically create blockdev devices - thats sounds pretty nice,
but might have other quirks and the attachment is not really persistent.

I think at least the "attached" parameter should be x-prefixed, but
better, leave it out for now until we know how we want Namespace
Attachment and Management to be implemented.
Klaus Jensen Oct. 20, 2020, 8:28 a.m. UTC | #6
On Oct 19 11:17, Dmitry Fomichev wrote:
> With ZNS support in place, the majority of code in nvme_rw() has
> become read- or write-specific. Move these parts to two separate
> handlers, nvme_read() and nvme_write() to make the code more
> readable and to remove multiple is_write checks that so far existed
> in the i/o path.
> 
> This is a refactoring patch, no change in functionality.
> 

This makes a lot of sense, totally Acked, but it might be better to move
it ahead as a preparation patch? It would make the zoned patch easier on
the eye.
Klaus Jensen Oct. 20, 2020, 11:08 a.m. UTC | #7
On Oct 19 11:17, Dmitry Fomichev wrote:
> diff --git a/hw/block/nvme-ns.c b/hw/block/nvme-ns.c
> index 974aea33f7..fedfad595c 100644
> --- a/hw/block/nvme-ns.c
> +++ b/hw/block/nvme-ns.c
> @@ -133,6 +320,12 @@ static Property nvme_ns_props[] = {
>      DEFINE_PROP_UINT32("nsid", NvmeNamespace, params.nsid, 0),
>      DEFINE_PROP_UUID("uuid", NvmeNamespace, params.uuid),
>      DEFINE_PROP_BOOL("attached", NvmeNamespace, params.attached, true),
> +    DEFINE_PROP_BOOL("zoned", NvmeNamespace, params.zoned, false),

Instead of using a 'zoned' property here, can we add an 'iocs' or 'csi'
property in the namespace types patch? Then, in the future if we add
additional command sets we won't need another property (like 'kv').

> +    DEFINE_PROP_SIZE("zone_size", NvmeNamespace, params.zone_size_bs,
> +                     NVME_DEFAULT_ZONE_SIZE),
> +    DEFINE_PROP_SIZE("zone_capacity", NvmeNamespace, params.zone_cap_bs, 0),

I would like that the zone_size and zone_capacity were named zoned.zsze
and zoned.zcap and were in terms of logical blocks, like in the spec.
Putting them in a pseudo-namespace makes it clear that the options
affect the zoned command set and reduces the risk of anything clashing
with the addition of other command sets (like 'kv') in the future.

> +    DEFINE_PROP_BOOL("cross_zone_read", NvmeNamespace,
> +                     params.cross_zone_read, false),

Instead of cluttering the parameters with a bunch of these when others
zone operational characteristics are added, can we use a 'zoned.zoc'
parameter that matches the spec?

> diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> index 93728e51b3..34d0d0250d 100644
> --- a/hw/block/nvme.c
> +++ b/hw/block/nvme.c
> @@ -3079,6 +4001,9 @@ static Property nvme_props[] = {
>      DEFINE_PROP_UINT32("aer_max_queued", NvmeCtrl, params.aer_max_queued, 64),
>      DEFINE_PROP_UINT8("mdts", NvmeCtrl, params.mdts, 7),
>      DEFINE_PROP_BOOL("use-intel-id", NvmeCtrl, params.use_intel_id, false),
> +    DEFINE_PROP_UINT8("fill_pattern", NvmeCtrl, params.fill_pattern, 0),
> +    DEFINE_PROP_SIZE32("zone_append_size_limit", NvmeCtrl, params.zasl_bs,
> +                       NVME_DEFAULT_MAX_ZA_SIZE),

Similar to my reasoning above, I would like this to be zoned.zasl and in
terms of logical blocks like the spec. Also, I think '0' is a better
default since zero values typically identify a default value in the spec
as well.

I know this might sound like bikeshedding, but I wanna make sure that we
get the parameters right since we cannot get rid of them once they are
there. Following the definitions of the spec makes it very clear what
their meaning are and should be. 'mdts' is currently the only other
parameter like this, but that is also specified as in the spec, and not
as an absolute value.

My preference also applies to subsequent patches, like using `zoned.mor`
and `zoned.mar` for the resource limits.
Keith Busch Oct. 20, 2020, 12:36 p.m. UTC | #8
On Tue, Oct 20, 2020 at 10:28:22AM +0200, Klaus Jensen wrote:
> On Oct 19 11:17, Dmitry Fomichev wrote:

> > With ZNS support in place, the majority of code in nvme_rw() has

> > become read- or write-specific. Move these parts to two separate

> > handlers, nvme_read() and nvme_write() to make the code more

> > readable and to remove multiple is_write checks that so far existed

> > in the i/o path.

> > 

> > This is a refactoring patch, no change in functionality.

> > 

> 

> This makes a lot of sense, totally Acked, but it might be better to move

> it ahead as a preparation patch? It would make the zoned patch easier on

> the eye.


I agree with the suggestion.
Dmitry Fomichev Oct. 20, 2020, 11:04 p.m. UTC | #9
> -----Original Message-----

> From: Klaus Jensen <its@irrelevant.dk>

> Sent: Monday, October 19, 2020 4:16 PM

> To: Dmitry Fomichev <Dmitry.Fomichev@wdc.com>

> Cc: Keith Busch <kbusch@kernel.org>; Klaus Jensen

> <k.jensen@samsung.com>; Kevin Wolf <kwolf@redhat.com>; Philippe

> Mathieu-Daudé <philmd@redhat.com>; Maxim Levitsky

> <mlevitsk@redhat.com>; Fam Zheng <fam@euphon.net>; Niklas Cassel

> <Niklas.Cassel@wdc.com>; Damien Le Moal <Damien.LeMoal@wdc.com>;

> qemu-block@nongnu.org; qemu-devel@nongnu.org; Alistair Francis

> <Alistair.Francis@wdc.com>; Matias Bjorling <Matias.Bjorling@wdc.com>

> Subject: Re: [PATCH v7 01/11] hw/block/nvme: Add Commands Supported

> and Effects log

> 

> On Oct 19 11:17, Dmitry Fomichev wrote:

> > This log page becomes necessary to implement to allow checking for

> > Zone Append command support in Zoned Namespace Command Set.

> >

> > This commit adds the code to report this log page for NVM Command

> > Set only. The parts that are specific to zoned operation will be

> > added later in the series.

> >

> > All incoming admin and i/o commands are now only processed if their

> > corresponding support bits are set in this log. This provides an

> > easy way to control what commands to support and what not to

> > depending on set CC.CSS.

> >

> > Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>

> > ---

> >  hw/block/nvme-ns.h    |  1 +

> >  hw/block/nvme.c       | 98 +++++++++++++++++++++++++++++++++++++++--

> --

> >  hw/block/trace-events |  2 +

> >  include/block/nvme.h  | 19 +++++++++

> >  4 files changed, 111 insertions(+), 9 deletions(-)

> >

> > diff --git a/hw/block/nvme-ns.h b/hw/block/nvme-ns.h

> > index 83734f4606..ea8c2f785d 100644

> > --- a/hw/block/nvme-ns.h

> > +++ b/hw/block/nvme-ns.h

> > @@ -29,6 +29,7 @@ typedef struct NvmeNamespace {

> >      int32_t      bootindex;

> >      int64_t      size;

> >      NvmeIdNs     id_ns;

> > +    const uint32_t *iocs;

> >

> >      NvmeNamespaceParams params;

> >  } NvmeNamespace;

> > diff --git a/hw/block/nvme.c b/hw/block/nvme.c

> > index 9d30ca69dc..5a9493d89f 100644

> > --- a/hw/block/nvme.c

> > +++ b/hw/block/nvme.c

> > @@ -111,6 +111,28 @@ static const uint32_t

> nvme_feature_cap[NVME_FID_MAX] = {

> >      [NVME_TIMESTAMP]                = NVME_FEAT_CAP_CHANGE,

> >  };

> >

> > +static const uint32_t nvme_cse_acs[256] = {

> > +    [NVME_ADM_CMD_DELETE_SQ]        = NVME_CMD_EFF_CSUPP,

> > +    [NVME_ADM_CMD_CREATE_SQ]        = NVME_CMD_EFF_CSUPP,

> > +    [NVME_ADM_CMD_DELETE_CQ]        = NVME_CMD_EFF_CSUPP,

> > +    [NVME_ADM_CMD_CREATE_CQ]        = NVME_CMD_EFF_CSUPP,

> > +    [NVME_ADM_CMD_IDENTIFY]         = NVME_CMD_EFF_CSUPP,

> > +    [NVME_ADM_CMD_SET_FEATURES]     = NVME_CMD_EFF_CSUPP,

> > +    [NVME_ADM_CMD_GET_FEATURES]     = NVME_CMD_EFF_CSUPP,

> > +    [NVME_ADM_CMD_GET_LOG_PAGE]     = NVME_CMD_EFF_CSUPP,

> > +    [NVME_ADM_CMD_ASYNC_EV_REQ]     = NVME_CMD_EFF_CSUPP,

> > +};

> 

> NVME_ADM_CMD_ABORT is missing. And since you added a (redundant)

> check

> in nvme_admin_cmd that cheks this table, Abort is now an invalid

> command.


Adding the ABORT, thanks. I think this code was conceived before abort was
merged, this is why it is missing.

> 

> Also, can you reorder it according to opcode instead of

> pseudo-lexicographically?


Ok, will move ...GET_LOG_PAGE  that is now out or order.

> 

> > +

> > +static const uint32_t nvme_cse_iocs_none[256] = {

> > +};

> 

> [-pedantic] no need for the '= {}'


OK.

> 

> > +

> > +static const uint32_t nvme_cse_iocs_nvm[256] = {

> > +    [NVME_CMD_FLUSH]                = NVME_CMD_EFF_CSUPP |

> NVME_CMD_EFF_LBCC,

> > +    [NVME_CMD_WRITE_ZEROES]         = NVME_CMD_EFF_CSUPP |

> NVME_CMD_EFF_LBCC,

> > +    [NVME_CMD_WRITE]                = NVME_CMD_EFF_CSUPP |

> NVME_CMD_EFF_LBCC,

> > +    [NVME_CMD_READ]                 = NVME_CMD_EFF_CSUPP,

> > +};

> > +

> >  static void nvme_process_sq(void *opaque);

> >

> >  static uint16_t nvme_cid(NvmeRequest *req)

> > @@ -1032,10 +1054,6 @@ static uint16_t nvme_io_cmd(NvmeCtrl *n,

> NvmeRequest *req)

> >      trace_pci_nvme_io_cmd(nvme_cid(req), nsid, nvme_sqid(req),

> >                            req->cmd.opcode, nvme_io_opc_str(req->cmd.opcode));

> >

> > -    if (NVME_CC_CSS(n->bar.cc) == NVME_CC_CSS_ADMIN_ONLY) {

> > -        return NVME_INVALID_OPCODE | NVME_DNR;

> > -    }

> > -

> 

> I would assume the device to respond with invalid opcode before

> validating the nsid if it is an admin only device.


The host can't make any assumptions about the ordering of validation
checks performed by the controller. In the case of receiving an i/o
command with invalid NSID when CC.CSS == ADMIN_ONLY, both
Invalid Opcode and Invalid NSID error status codes should be acceptable.

> 

> >      if (!nvme_nsid_valid(n, nsid)) {

> >          return NVME_INVALID_NSID | NVME_DNR;

> >      }

> > @@ -1045,6 +1063,11 @@ static uint16_t nvme_io_cmd(NvmeCtrl *n,

> NvmeRequest *req)

> >          return NVME_INVALID_FIELD | NVME_DNR;

> >      }

> >

> > +    if (!(req->ns->iocs[req->cmd.opcode] & NVME_CMD_EFF_CSUPP)) {

> > +        trace_pci_nvme_err_invalid_opc(req->cmd.opcode);

> > +        return NVME_INVALID_OPCODE | NVME_DNR;

> > +    }

> > +

> >      switch (req->cmd.opcode) {

> >      case NVME_CMD_FLUSH:

> >          return nvme_flush(n, req);

> > @@ -1054,8 +1077,7 @@ static uint16_t nvme_io_cmd(NvmeCtrl *n,

> NvmeRequest *req)

> >      case NVME_CMD_READ:

> >          return nvme_rw(n, req);

> >      default:

> > -        trace_pci_nvme_err_invalid_opc(req->cmd.opcode);

> > -        return NVME_INVALID_OPCODE | NVME_DNR;

> > +        assert(false);

> >      }

> >  }

> >

> > @@ -1291,6 +1313,39 @@ static uint16_t nvme_error_info(NvmeCtrl *n,

> uint8_t rae, uint32_t buf_len,

> >                      DMA_DIRECTION_FROM_DEVICE, req);

> >  }

> >

> > +static uint16_t nvme_cmd_effects(NvmeCtrl *n, uint32_t buf_len,

> > +                                 uint64_t off, NvmeRequest *req)

> > +{

> > +    NvmeEffectsLog log = {};

> 

> [-pedantic] and empty initializer list is not allowed, should be '{0}'.


Could you please point me to a document where it is not allowed?
I can see around 900 occurrences of this construct in the current QEMU
C code...

> 

> > +    const uint32_t *src_iocs = NULL;

> > +    uint32_t trans_len;

> > +

> > +    trace_pci_nvme_cmd_supp_and_effects_log_read();

> 

> This has just been traced in nvme_admin_cmd and this doesn't add any

> additional info.

> 


Ok, this one is not really needed, will remove.

> > +

> > +    if (off >= sizeof(log)) {

> > +        trace_pci_nvme_err_invalid_effects_log_offset(off);

> 

> Can we do `trace_pci_nvme_err_invalid_log_page_offset(off) instead? Then

> we can easily reuse it in the other log pages.


Will rename.

> 

> > +        return NVME_INVALID_FIELD | NVME_DNR;

> > +    }

> > +

> > +    switch (NVME_CC_CSS(n->bar.cc)) {

> > +    case NVME_CC_CSS_NVM:

> > +        src_iocs = nvme_cse_iocs_nvm;

> > +    case NVME_CC_CSS_ADMIN_ONLY:

> > +        break;

> > +    }

> > +

> > +    memcpy(log.acs, nvme_cse_acs, sizeof(nvme_cse_acs));

> > +

> > +    if (src_iocs) {

> > +        memcpy(log.iocs, src_iocs, sizeof(log.iocs));

> > +    }

> > +

> > +    trans_len = MIN(sizeof(log) - off, buf_len);

> > +

> > +    return nvme_dma(n, ((uint8_t *)&log) + off, trans_len,

> > +                    DMA_DIRECTION_FROM_DEVICE, req);

> > +}

> > +

> >  static uint16_t nvme_get_log(NvmeCtrl *n, NvmeRequest *req)

> >  {

> >      NvmeCmd *cmd = &req->cmd;

> > @@ -1334,6 +1389,8 @@ static uint16_t nvme_get_log(NvmeCtrl *n,

> NvmeRequest *req)

> >          return nvme_smart_info(n, rae, len, off, req);

> >      case NVME_LOG_FW_SLOT_INFO:

> >          return nvme_fw_log_info(n, len, off, req);

> > +    case NVME_LOG_CMD_EFFECTS:

> > +        return nvme_cmd_effects(n, len, off, req);

> >      default:

> >          trace_pci_nvme_err_invalid_log_page(nvme_cid(req), lid);

> >          return NVME_INVALID_FIELD | NVME_DNR;

> > @@ -1920,6 +1977,11 @@ static uint16_t nvme_admin_cmd(NvmeCtrl *n,

> NvmeRequest *req)

> >      trace_pci_nvme_admin_cmd(nvme_cid(req), nvme_sqid(req), req-

> >cmd.opcode,

> >                               nvme_adm_opc_str(req->cmd.opcode));

> >

> > +    if (!(nvme_cse_acs[req->cmd.opcode] & NVME_CMD_EFF_CSUPP)) {

> > +        trace_pci_nvme_err_invalid_admin_opc(req->cmd.opcode);

> > +        return NVME_INVALID_OPCODE | NVME_DNR;

> > +    }

> > +

> 

> This is the (redundant) check that effectively makes Abort an invalid

> command.


This check not redundant - I think it is a better alternative to checking
for CC.CSS == ADMIN_ONLY. This way, the actual set of supported
commands is always in sync with what is advertised in CSE log.
And this approach makes it easy to support i/o Command Set Specific
admin commands in the future. For that, ns->acs can be introduced along
the same lines as ns->iocs. For now, of course, this is not necessary.

> 

> >      switch (req->cmd.opcode) {

> >      case NVME_ADM_CMD_DELETE_SQ:

> >          return nvme_del_sq(n, req);

> > @@ -1942,8 +2004,7 @@ static uint16_t nvme_admin_cmd(NvmeCtrl *n,

> NvmeRequest *req)

> >      case NVME_ADM_CMD_ASYNC_EV_REQ:

> >          return nvme_aer(n, req);

> >      default:

> > -        trace_pci_nvme_err_invalid_admin_opc(req->cmd.opcode);

> > -        return NVME_INVALID_OPCODE | NVME_DNR;

> > +        assert(false);

> >      }

> >  }

> >

> > @@ -2031,6 +2092,23 @@ static void nvme_clear_ctrl(NvmeCtrl *n)

> >      n->bar.cc = 0;

> >  }

> >

> > +static void nvme_select_ns_iocs(NvmeCtrl *n)

> > +{

> > +    NvmeNamespace *ns;

> > +    int i;

> > +

> > +    for (i = 1; i <= n->num_namespaces; i++) {

> > +        ns = nvme_ns(n, i);

> > +        if (!ns) {

> > +            continue;

> > +        }

> > +        ns->iocs = nvme_cse_iocs_none;

> > +        if (NVME_CC_CSS(n->bar.cc) != NVME_CC_CSS_ADMIN_ONLY) {

> > +            ns->iocs = nvme_cse_iocs_nvm;

> > +        }

> > +    }

> > +}

> > +

> >  static int nvme_start_ctrl(NvmeCtrl *n)

> >  {

> >      uint32_t page_bits = NVME_CC_MPS(n->bar.cc) + 12;

> > @@ -2129,6 +2207,8 @@ static int nvme_start_ctrl(NvmeCtrl *n)

> >

> >      QTAILQ_INIT(&n->aer_queue);

> >

> > +    nvme_select_ns_iocs(n);

> > +

> >      return 0;

> >  }

> >

> > @@ -2737,7 +2817,7 @@ static void nvme_init_ctrl(NvmeCtrl *n, PCIDevice

> *pci_dev)

> >      id->acl = 3;

> >      id->aerl = n->params.aerl;

> >      id->frmw = (NVME_NUM_FW_SLOTS << 1) | NVME_FRMW_SLOT1_RO;

> > -    id->lpa = NVME_LPA_NS_SMART | NVME_LPA_EXTENDED;

> > +    id->lpa = NVME_LPA_NS_SMART | NVME_LPA_CSE |

> NVME_LPA_EXTENDED;

> >

> >      /* recommended default value (~70 C) */

> >      id->wctemp = cpu_to_le16(NVME_TEMPERATURE_WARNING);

> > diff --git a/hw/block/trace-events b/hw/block/trace-events

> > index fac5995d94..0ae9cb0d35 100644

> > --- a/hw/block/trace-events

> > +++ b/hw/block/trace-events

> > @@ -85,6 +85,7 @@ pci_nvme_mmio_start_success(void) "setting

> controller enable bit succeeded"

> >  pci_nvme_mmio_stopped(void) "cleared controller enable bit"

> >  pci_nvme_mmio_shutdown_set(void) "shutdown bit set"

> >  pci_nvme_mmio_shutdown_cleared(void) "shutdown bit cleared"

> > +pci_nvme_cmd_supp_and_effects_log_read(void) "commands supported

> and effects log read"

> >

> >  # nvme traces for error conditions

> >  pci_nvme_err_mdts(uint16_t cid, size_t len) "cid %"PRIu16" len %zu"

> > @@ -104,6 +105,7 @@ pci_nvme_err_invalid_prp(void) "invalid PRP"

> >  pci_nvme_err_invalid_opc(uint8_t opc) "invalid opcode 0x%"PRIx8""

> >  pci_nvme_err_invalid_admin_opc(uint8_t opc) "invalid admin opcode

> 0x%"PRIx8""

> >  pci_nvme_err_invalid_lba_range(uint64_t start, uint64_t len, uint64_t

> limit) "Invalid LBA start=%"PRIu64" len=%"PRIu64" limit=%"PRIu64""

> > +pci_nvme_err_invalid_effects_log_offset(uint64_t ofs) "commands

> supported and effects log offset must be 0, got %"PRIu64""

> >  pci_nvme_err_invalid_del_sq(uint16_t qid) "invalid submission queue

> deletion, sid=%"PRIu16""

> >  pci_nvme_err_invalid_create_sq_cqid(uint16_t cqid) "failed creating

> submission queue, invalid cqid=%"PRIu16""

> >  pci_nvme_err_invalid_create_sq_sqid(uint16_t sqid) "failed creating

> submission queue, invalid sqid=%"PRIu16""

> > diff --git a/include/block/nvme.h b/include/block/nvme.h

> > index 6de2d5aa75..4779495b7d 100644

> > --- a/include/block/nvme.h

> > +++ b/include/block/nvme.h

> > @@ -744,10 +744,27 @@ enum NvmeSmartWarn {

> >      NVME_SMART_FAILED_VOLATILE_MEDIA  = 1 << 4,

> >  };

> >

> > +typedef struct NvmeEffectsLog {

> > +    uint32_t    acs[256];

> > +    uint32_t    iocs[256];

> > +    uint8_t     resv[2048];

> > +} NvmeEffectsLog;

> > +

> > +enum {

> > +    NVME_CMD_EFF_CSUPP      = 1 << 0,

> > +    NVME_CMD_EFF_LBCC       = 1 << 1,

> > +    NVME_CMD_EFF_NCC        = 1 << 2,

> > +    NVME_CMD_EFF_NIC        = 1 << 3,

> > +    NVME_CMD_EFF_CCC        = 1 << 4,

> > +    NVME_CMD_EFF_CSE_MASK   = 3 << 16,

> > +    NVME_CMD_EFF_UUID_SEL   = 1 << 19,

> > +};

> > +

> >  enum NvmeLogIdentifier {

> >      NVME_LOG_ERROR_INFO     = 0x01,

> >      NVME_LOG_SMART_INFO     = 0x02,

> >      NVME_LOG_FW_SLOT_INFO   = 0x03,

> > +    NVME_LOG_CMD_EFFECTS    = 0x05,

> >  };

> >

> >  typedef struct QEMU_PACKED NvmePSD {

> > @@ -860,6 +877,7 @@ enum NvmeIdCtrlFrmw {

> >

> >  enum NvmeIdCtrlLpa {

> >      NVME_LPA_NS_SMART = 1 << 0,

> > +    NVME_LPA_CSE      = 1 << 1,

> >      NVME_LPA_EXTENDED = 1 << 2,

> >  };

> >

> > @@ -1059,6 +1077,7 @@ static inline void _nvme_check_size(void)

> >      QEMU_BUILD_BUG_ON(sizeof(NvmeErrorLog) != 64);

> >      QEMU_BUILD_BUG_ON(sizeof(NvmeFwSlotInfoLog) != 512);

> >      QEMU_BUILD_BUG_ON(sizeof(NvmeSmartLog) != 512);

> > +    QEMU_BUILD_BUG_ON(sizeof(NvmeEffectsLog) != 4096);

> >      QEMU_BUILD_BUG_ON(sizeof(NvmeIdCtrl) != 4096);

> >      QEMU_BUILD_BUG_ON(sizeof(NvmeIdNs) != 4096);

> >      QEMU_BUILD_BUG_ON(sizeof(NvmeSglDescriptor) != 16);

> > --

> > 2.21.0

> >

> >

> 

> --

> One of us - No more doubt, silence or taboo about mental illness.
Dmitry Fomichev Oct. 20, 2020, 11:05 p.m. UTC | #10
> -----Original Message-----

> From: Keith Busch <kbusch@kernel.org>

> Sent: Tuesday, October 20, 2020 8:36 AM

> To: Klaus Jensen <its@irrelevant.dk>

> Cc: Dmitry Fomichev <Dmitry.Fomichev@wdc.com>; Klaus Jensen

> <k.jensen@samsung.com>; Kevin Wolf <kwolf@redhat.com>; Philippe

> Mathieu-Daudé <philmd@redhat.com>; Maxim Levitsky

> <mlevitsk@redhat.com>; Fam Zheng <fam@euphon.net>; Niklas Cassel

> <Niklas.Cassel@wdc.com>; Damien Le Moal <Damien.LeMoal@wdc.com>;

> qemu-block@nongnu.org; qemu-devel@nongnu.org; Alistair Francis

> <Alistair.Francis@wdc.com>; Matias Bjorling <Matias.Bjorling@wdc.com>

> Subject: Re: [PATCH v7 10/11] hw/block/nvme: Separate read and write

> handlers

> 

> On Tue, Oct 20, 2020 at 10:28:22AM +0200, Klaus Jensen wrote:

> > On Oct 19 11:17, Dmitry Fomichev wrote:

> > > With ZNS support in place, the majority of code in nvme_rw() has

> > > become read- or write-specific. Move these parts to two separate

> > > handlers, nvme_read() and nvme_write() to make the code more

> > > readable and to remove multiple is_write checks that so far existed

> > > in the i/o path.

> > >

> > > This is a refactoring patch, no change in functionality.

> > >

> >

> > This makes a lot of sense, totally Acked, but it might be better to move

> > it ahead as a preparation patch? It would make the zoned patch easier on

> > the eye.

> 

> I agree with the suggestion.


Ok, will move them to the front of the series.
Dmitry Fomichev Oct. 20, 2020, 11:09 p.m. UTC | #11
> -----Original Message-----

> From: Klaus Jensen <its@irrelevant.dk>

> Sent: Tuesday, October 20, 2020 4:21 AM

> To: Dmitry Fomichev <Dmitry.Fomichev@wdc.com>

> Cc: Keith Busch <kbusch@kernel.org>; Klaus Jensen

> <k.jensen@samsung.com>; Kevin Wolf <kwolf@redhat.com>; Philippe

> Mathieu-Daudé <philmd@redhat.com>; Maxim Levitsky

> <mlevitsk@redhat.com>; Fam Zheng <fam@euphon.net>; Niklas Cassel

> <Niklas.Cassel@wdc.com>; Damien Le Moal <Damien.LeMoal@wdc.com>;

> qemu-block@nongnu.org; qemu-devel@nongnu.org; Alistair Francis

> <Alistair.Francis@wdc.com>; Matias Bjorling <Matias.Bjorling@wdc.com>

> Subject: Re: [PATCH v7 04/11] hw/block/nvme: Support allocated CNS

> command variants

> 

> On Oct 19 11:17, Dmitry Fomichev wrote:

> 

> (snip)

> 

> > CAP.CSS (together with the I/O Command Set data structure) defines

> > what command sets are supported by the controller.

> >

> > CC.CSS (together with Set Profile) can be set to enable a subset of

> > the available command sets.

> >

> > Even if a user configures CC.CSS to e.g. Admin only, NVM namespaces

> > will still be attached (and thus marked as active).

> > Similarly, if a user configures CC.CSS to e.g. NVM, ZNS namespaces

> > will still be attached (and thus marked as active).

> >

> > However, any operation from a disabled command set will result in a

> > Invalid Command Opcode.

> >

> 

> This part of the commit message seems irrelevant to the patch.

> 

> > Add a new Boolean namespace property, "attached", to provide the most

> > basic namespace attachment support. The default value for this new

> > property is true. Also, implement the logic in the new CNS values to

> > include/exclude namespaces based on this new property. The only thing

> > missing is hooking up the actual Namespace Attachment command opcode,

> > which will allow a user to toggle the "attached" flag per namespace.

> >

> 

> Without Namespace Attachment support, the sole purpose of this

> parameter

> is to allow unusable namespace IDs to be reported. I have no problems

> with adding support for the additional CNS values. They will return

> identical responses, but I think that is good enough for now.

> 

> When it is not really needed, we should be wary of adding a parameter

> that is really hard to get rid of again.

> 

> > The reason for not hooking up this command completely is because the

> > NVMe specification requires the namespace management command to be

> > supported if the namespace attachment command is supported.

> >

> 

> There are many ways to support Namespace Management, and there are a

> lot

> of quirks with each of them. Do we use a big blockdev and carve out

> namespaces? Then, what are the semantics of an image resize operation?

> 

> Do we dynamically create blockdev devices - thats sounds pretty nice,

> but might have other quirks and the attachment is not really persistent.

> 

> I think at least the "attached" parameter should be x-prefixed, but

> better, leave it out for now until we know how we want Namespace

> Attachment and Management to be implemented.


I don't mind leaving this property out. I used it for testing the patch and it
could, in theory, be manipulated by an external process doing NS
Management, but, as you said, there is no certainty about now NS
Management will be implemented and any related CLI interface should
better be added as a part of this future work, not now.