diff mbox series

[v6,03/12] docs: qcom: Add qualcomm minidump guide

Message ID 1700864395-1479-4-git-send-email-quic_mojha@quicinc.com
State New
Headers show
Series Add Qualcomm Minidump driver related support | expand

Commit Message

Mukesh Ojha Nov. 24, 2023, 10:19 p.m. UTC
Add the qualcomm minidump guide for the users which tries to cover
the dependency, API use and the way to test and collect minidump
on Qualcomm supported SoCs.

Signed-off-by: Mukesh Ojha <quic_mojha@quicinc.com>
---
 Documentation/admin-guide/index.rst         |   1 +
 Documentation/admin-guide/qcom_minidump.rst | 272 ++++++++++++++++++++++++++++
 2 files changed, 273 insertions(+)
 create mode 100644 Documentation/admin-guide/qcom_minidump.rst

Comments

Bryan O'Donoghue Nov. 28, 2023, 1:15 p.m. UTC | #1
On 24/11/2023 22:19, Mukesh Ojha wrote:
> Add the qualcomm minidump guide for the users which tries to cover
> the dependency, API use and the way to test and collect minidump
> on Qualcomm supported SoCs.
> 
> Signed-off-by: Mukesh Ojha <quic_mojha@quicinc.com>
> ---
>   Documentation/admin-guide/index.rst         |   1 +
>   Documentation/admin-guide/qcom_minidump.rst | 272 ++++++++++++++++++++++++++++
>   2 files changed, 273 insertions(+)
>   create mode 100644 Documentation/admin-guide/qcom_minidump.rst

General nit-pick on sequencing of your patches. Its good practice (tm) 
to put your documentation first in your series.

> diff --git a/Documentation/admin-guide/index.rst b/Documentation/admin-guide/index.rst
> index 43ea35613dfc..251d070486c2 100644
> --- a/Documentation/admin-guide/index.rst
> +++ b/Documentation/admin-guide/index.rst
> @@ -120,6 +120,7 @@ configure specific aspects of kernel behavior to your liking.
>      perf-security
>      pm/index
>      pnp
> +   qcom_minidump
>      rapidio
>      ras
>      rtc
> diff --git a/Documentation/admin-guide/qcom_minidump.rst b/Documentation/admin-guide/qcom_minidump.rst
> new file mode 100644
> index 000000000000..b492f2b79639
> --- /dev/null
> +++ b/Documentation/admin-guide/qcom_minidump.rst
> @@ -0,0 +1,272 @@
> +Qualcomm minidump feature
> +=========================
> +
> +Introduction
> +------------
> +
> +Minidump is a best effort mechanism to collect useful and predefined
> +data for first level of debugging on end user devices running on
> +Qualcomm SoCs. 

What does "first-level debugging" mean here ? You use the term 
"post-mortem" later, stick with one term for consistency throughout your 
document.

Suggest:

"Minidump is a best-effort mechanism to collect useful predefined data 
for post-mortem debugging on a Qualcomm System on Chips (SoCs)."

Generally better to stick with "Qualcomm SoC" and "Minidump" once you 
establish the terms upfront and early in your text - instead of 
reverting to "device" and "it".

It is built on the premise that System on Chip (SoC)
> +or subsystem part of SoC crashes, due to a range of hardware and
> +software bugs.
Instead of saying "It" say "Minidump"

Hence, the ability to collect accurate data is only
> +a best-effort. The data collected could be invalid or corrupted, data
> +collection itself could fail, and so on.

"and so on" is a redundancy, drop.


Suggest:
Minidump is built on the premise that a hardware or software component 
on the SoC has encountered an unexpected fault. This means that the data 
collected by minidump cannot be assumed to be correct or even present.

> +
> +Qualcomm devices in engineering mode provides a mechanism for generating
> +full system RAM dumps for post-mortem debugging. 

Stick with the established naming convention

"Qualcomm SoCs in engineering mode provide a mechanism for generating 
full system RAM dumps for this post-mortem debugging"

But in some cases it's
> +however not feasible to capture the entire content of RAM. The minidump
> +mechanism provides the means for selected region should be included in
> +the ramdump.

Dont' start a sentence with "But"

You're also not being clear what you mean by "the entire content of RAM" 
since you obviously can't capture a full snap-shot of DRAM and store in 
DRAM.

Better to say IMO

"Minidump captures specific pre-defined regions of RAM and stores those 
regions in a reserved Minidump specific buffer."

"The region of RAM used to store Minidump data shall be referred to as 
SMEM throughout this document." [1]

> +::
> +
> +   +-----------------------------------------------+
> +   |   DDR                       +-------------+   |

Instead of saying "DDR" just mark this as RAM or Memory.
Its a terrible nit-pick from me but DDR = "Double Data Rate Synchronous 
Dynamic Random-Access Memory" but that's irrelevant to this 
specification. We could be living in a gigantic SRAM for argument sake.

> +   |                             |      SS0-ToC|   |
> +   | +----------------+     +----------------+ |   |
> +   | |Shared memory   |     |         SS1-ToC| |   |
> +   | |(SMEM)          |     |                | |   |
> +   | |                | +-->|--------+       | |   |
> +   | |G-ToC           | |   | SS-ToC  \      | |   |
> +   | |+-------------+ | |   | +-----------+  | |   |
> +   | ||-------------| | |   | |-----------|  | |   |
> +   | || SS0-ToC     | | | +-|<|SS1 region1|  | |   |
> +   | ||-------------| | | | | |-----------|  | |   |
> +   | || SS1-ToC     |-|>+ | | |SS1 region2|  | |   |
> +   | ||-------------| |   | | |-----------|  | |   |
> +   | || SS2-ToC     | |   | | |  ...      |  | |   |
> +   | ||-------------| |   | | |-----------|  | |   |
> +   | ||  ...        | |   |-|<|SS1 regionN|  | |   |
> +   | ||-------------| |   | | |-----------|  | |   |
> +   | || SSn-ToC     | |   | | +-----------+  | |   |
> +   | |+-------------+ |   | |                | |   |
> +   | |                |   | |----------------| |   |
> +   | |                |   +>|  regionN       | |   |
> +   | |                |   | |----------------| |   |
> +   | +----------------+   | |                | |   |
> +   |                      | |----------------| |   |
> +   |                      +>|  region1       | |   |
> +   |                        |----------------| |   |
> +   |                        |                | |   |
> +   |                        |----------------|-+   |
> +   |                        |  region5       |     |
> +   |                        |----------------|     |
> +   |                        |                |     |
> +   |  Region information    +----------------+     |
> +   | +---------------+                             |
> +   | |region name    |                             |
> +   | |---------------|                             |
> +   | |region address |                             |
> +   | |---------------|                             |
> +   | |region size    |                             |
> +   | +---------------+                             |
> +   +-----------------------------------------------+
> +       G-ToC: Global table of contents
> +       SS-ToC: Subsystem table of contents
> +       SS0-SSn: Subsystem numbered from 0 to n

You need to be more consistent here

SSX-ToC -> SSX-SSn

Where X is an integer from 0 upwards and similarly n is a integer from 0 
upwards.

This name SSX-SSn is not especially descriptive, I'm not sure if this is 
a name you are choosing here ? If so then let me suggest a new name like 
"Subsystem Memory Segment" SSX-MSn

->

        G-ToC: Global table of contents
        SSX-ToC: Subsystem X table of contents.
                 X is an integer in the range of 0 to ?
                 Is there an upper limit ?
                 Presumably this is an 8, 16, 32 or 64 bit integer
                 Please define either the size of the integer or the
                 valid range of values 0..128, 0..256
        SSX-MSn: Subsystem numbered from 0 to n
                 Same comment for the 'n' here.

> +
> +It depends on SoC where the underlying firmware is keeping the
> +minidump global table taking care of subsystem ToC part for
> +minidump like for above diagram, it is for shared memory sitting
> +in DDR and it is shared among various master however it is possible

> +that this could be implemented via memory mapped regions but the
> +general idea should remain same. Here, various subsystem could be
> +DSP's like ADSP/CDSP/MODEM etc, along with Application processor
> +(APSS) where Linux runs. 


DSP minidump gets collected when DSP's goes
> +for recovery followed by a crash. The minidump part of code for
> +that resides in ``qcom_rproc_minidump.c``.

This paragraph is difficult to parse.

What you are describing here is a linked list, I think you should have a 
paragraph describing how the memory structure works

->

"Minidump determines which areas of DRAM to capture via a Minidump 
defined linked-list structure.

At the top level a Global Table of Contents (GTOC) enumerates a variable 
number of SubSystem Table Of Contents (SSTOC) structures.

Each SSTOC contains a list of SubSystem Memory Segements which are named 
according to the containing SSTOC hence (SSX-MSn) where "X" denotes the 
SystemSystem index of the containing SSX-ToC and "n" denotes an 
individual Memory Segment within the SystemSystem. Hence SS0-MS0 belongs 
to SS0-ToC whereas SS1-MS0 belongs to SS1-ToC."

Then I think you can describe how the crash dump colleciton works and 
which agents of the system - DSP ? is responsible for collecting the 
crashdump

->

"The Application Processor SubSystem (APSS) runs the Linux kernel and is 
therefore not responsible for assembling Minidump data. One of the other 
system agents in the SoC will be responsible for capturing the Minidump 
data during system reset.

Typically one of the SoC Digital Signal Processors (DSP) will be used 
for this purpose.

During reset the DSP will walk the GTOC, SSX-ToCs and SSX-MSns 
populating the Minidump RAM area with the indicated memory"

> +
> +
> +SMEM as backend


> +----------------
> +
> +In this document, SMEM will be used as the backend implementation
> +of minidump.

[1] As per the above link, you need to introduce the term SMEM earlier.

It's fine to expand on its meaning later but, do please define it once 
upfront before you use it in your awesome ASCII art.

> +The core of minidump feature is part of Qualcomm's boot firmware code.
> +It initializes shared memory (SMEM), which is a part of DDR and
> +allocates a small section of it to minidump table, i.e. also called
> +global table of contents (G-ToC). Each subsystem (APSS, ADSP, ...) has
> +its own table of segments to be included in the minidump, all
> +references from a descriptor in SMEM (G-ToC). Each segment/region has
> +some details like name, physical address and its size etc. and it
> +could be anywhere scattered in the DDR.

->

"The SoC's bootloader must reserve an area of RAM as SMEM prior to 
handing over control to the run-time operating system. The bootloader is 
responsible to place the GTOC at the starting address of SMEM."

If you want to give more technical details of size, physical address - 
then explicitly define those in the section above which talks about the 
linked-list structure.

Please try to avoid use of "etc" or "and so on" since it assumes the 
reader already knows how the system works and can fill in the blanks 
but, what you are doing here is educating a Minidump novice in how 
things work.

> +
> +Qualcomm APSS Minidump kernel driver concept
> +--------------------------------------------
> +
> +Qualcomm APSS minidump kernel driver adds the capability to add Linux

So why "Minidump" and then "minidump" choose one.

> +region to be dumped as part of RAM dump collection.


OK so this really is the "meat" of the system. The bootloader/firmware 
populates the GTOC.

The Q this document should probably answer is how the kernel driver 
knows how/where to place its data.

Assumed to be parsing the DTB.

  At the moment,
> +shared memory driver creates platform device for minidump driver and
> +give a means to APSS minidump to initialize itself on probe.

"At the moment" is another drop.

Just make a clear statement

"The minidump platform driver populates the APSS porition of the GTOC"

more interesting to me is - are there defined numbers, identifiers for 
the APSS ? or do we just add new entries to the GTOC ?

ie. is there a reserved index or "type" in the GTOC that identifies 
where the APSS needs to insert itself ?

> +This driver provides ``qcom_minidump_region_register`` and
> +``qcom_minidump_region_unregister`` API's to register and unregister
> +APSS minidump region. 

Why does it do that ? Is it not the case that the driver knows where the 
APSS data goes ?

It also supports registration for the clients
> +who came before minidump driver was initialized. It maintains pending
> +list of clients who came before minidump and once minidump is initialized
> +it registers them in one go.

Don't start sentences with "It" -> "The driver" or "Minidump"

As I read this though, the Minidump driver in Linux isn't just 
registering / managing the APSS side of things but also "doing stuff" 
for other clients ?

How does the Linux driver know what to register ?

> +
> +To simplify post-mortem debugging, driver creates and maintain an ELF

the driver creates and maintains

> +header as first region that gets updated each time a new region gets
> +registered.

as the first region

So - who is registering these regions ? Linux kernel drivers ? aDSP / cDSP ?

If I write a new driver for Venus or GPU can I define my own region(s) 
to be captured ?

Presumably. Please give more detail on this.

> +
> +The solution supports extracting the RAM dump/minidump produced either
> +over USB or stored to an attached storage device.

What provides that functionality ? The bootloader ?

How do you trigger / capture that dump from the bootloader ?

No need to go into super-detail but give some idea.

> +
> +Dependency of minidump kernel driver
> +------------------------------------
> +
> +It is to note that whole of minidump depends on Qualcomm boot firmware
> +whether it supports minidump or not. 

You can drop this - you've already stated the bootloader/firmware must 
setup the initial table so, you're not providing additional information 
with this statement.

> So, if the minidump SMEM ID is

Try not to start sentences with "So"

SMEM ID ? This is your first time using this term - please relate it 
back to your ASCII diagram and the description you give with that text.

> +present in shared memory, it indicates that minidump is supported from
> +boot firmware and it is possible to dump Linux (APSS) region as part
> +of minidump collection.

If _which_ SMEM ID ?

It seems to me as if we are missing some important information here - 
what are the list of SMEM IDs ?

Are these IDs serial and incrementing across SoC versions or SoC specific ?

You need to define that data.

> +How a kernel client driver can register region with minidump
> +------------------------------------------------------------

Answering yes to my earlier question. A driver I write can make use of 
the API you are providing here.

Great. Please give some indication of that earlier, even if its a 
reference to this description you give here "See X.Y later in this document"

> +
> +Client driver can use ``qcom_minidump_region_register`` API's to register
> +and ``qcom_minidump_region_unregister`` to unregister their region from
> +minidump driver.
> +
> +Client needs to fill their region by filling ``qcom_minidump_region``
> +structure object which consists of the region name, region's virtual
> +and physical address and its size.

Nit pick. You need a definite article here "A client driver" etc.

> +
> +Below, is one sample client driver snippet which tries to allocate a
> +region from kernel heap of certain size and it writes a certain known
> +pattern.

Good

  (that can help in verification after collection that we got
> +the exact pattern, what we wrote) and registers it with minidump.

Not necessary to define this. We are all smart here and by now the 
intent of the mechanism is defined..

> +
> + .. code-block:: c
> +
> +  #include <soc/qcom/qcom_minidump.h>
> +  [...]
> +
> +
> +  [... inside a function ...]
> +  struct qcom_minidump_region region;
> +
> +  [...]
> +
> +  client_mem_region = kzalloc(region_size, GFP_KERNEL);
> +  if (!client_mem_region)
> +	return -ENOMEM;
> +
> +  [... Just write a pattern ...]
> +  memset(client_mem_region, 0xAB, region_size);
> +
> +  [... Fill up the region object ...]
> +  strlcpy(region.name, "REGION_A", sizeof(region.name));
> +  region.virt_addr = client_mem_region;
> +  region.phys_addr = virt_to_phys(client_mem_region);
> +  region.size = region_size;
> +
> +  ret = qcom_minidump_region_register(&region);
> +  if (ret < 0) {
> +	pr_err("failed to add region in minidump: err: %d\n", ret);
> +	return ret;
> +  }
> +
> +  [...]
> +
> +
> +Test
> +----

Testing

> +
> +Existing Qualcomm devices already supports entire RAM dump (also called
> +full dump) by writing appropriate value to Qualcomm's top control and
> +status register (tcsr) in ``driver/firmware/qcom_scm.c`` .

"Existing Qualcomm SoCs already support dumping the entire RAM to the 
SMEM area/segment/whatever"

This is 100% counter-intuitive since SMEM lives in RAM, correct ?

Full dump means what, a full dump of the APSS RAM ? What happens if SMEM 
cannot accommodate the full APSS RAM dump ?

> +
> +SCM device Tree bindings required to support download mode
> +For example (sm8450) ::
> +
> +	/ {
> +
> +	[...]
> +
> +		firmware {
> +			scm: scm {
> +				compatible = "qcom,scm-sm8450", "qcom,scm";
> +				[... tcsr register ... ]
> +				qcom,dload-mode = <&tcsr 0x13000>;
> +
> +				[...]
> +			};
> +		};
> +
> +	[...]
> +
> +		soc: soc@0 {
> +
> +			[...]
> +
> +			tcsr: syscon@1fc0000 {
> +				compatible = "qcom,sm8450-tcsr", "syscon";
> +				reg = <0x0 0x1fc0000 0x0 0x30000>;
> +			};
> +
> +			[...]
> +		};
> +	[...]
> +
> +	};
> +
> +User of minidump can pass ``qcom_scm.download_mode="mini"`` to kernel
> +commandline to set the current download mode to minidump.

"A kernel command line parameter is provided 
``qcom_scm.download_mode="mini"`` to facilitate ... but you aren't 
telling us what "minidump" captures "the current download" ? do you mean 
the current state ?

Does the system continue to boot up if you pass 
qcom_scm.download_mode="mini ? will additional registrations to 
SMEM/Minidump work ?

What happens to the minidump data if there is a _subsequent_ real 
crashdump ?

Overwritten ?

Also what happens if SMEM runs out of space ? Say I boot with 
``qcom_scm.download_mode="mini"`` and then the system crashes - SMEM has 
a limit right ?

So the minidump gets overwritten ?

> +Similarly, ``"full"`` is passed to set the download mode to full dump
> +where entire RAM dump will be collected while setting it ``"full,mini"``
> +will collect minidump along with fulldump.

Still not super-clear what the difference between mini and full is here.

> +
> +Writing to sysfs node can also be used to set the mode to minidump::
> +
> +	echo "mini" > /sys/module/qcom_scm/parameter/download_mode
> +
> +Once the download mode is set, any kind of crash will make the device collect
> +respective dump as per set download mode.

Nice.

> +
> +Dump collection
> +---------------
> +::
> +
> +	+-----------+
> +	|           |
> +	|           |         +------+
> +	|           |         |      |
> +	|           |         +--+---+ Product(Qualcomm SoC)
> +	+-----------+             |
> +	|+++++++++++|<------------+
> +	|+++++++++++|    usb cable
> +	+-----------+
> +            x86_64 PC
> +
> +The solution supports a product running with Qualcomm SoC (where minidump)
> +is supported from the firmware) connected to x86_64 host PC running PCAT
> +tool.

It supports downloading the minidump produced from product to the
> +host PC over USB or to save the minidump to the product attached storage
> +device(UFS/eMMC/SD Card) into minidump dedicated partition.

It would be a good idea to reference this section earlier.

> +
> +By default, dumps are downloaded via USB to the attached x86_64 PC running
> +PCAT (Qualcomm tool) software. Upon download, we will see a set of binary
> +blobs starting with name ``md_*`` in PCAT configured directory in x86_64
> +machine, so for above example from the client it will be ``md_REGION_A.BIN``.
> +This binary blob depends on region content to determine whether it needs
> +external parser support to get the content of the region, so for simple
> +plain ASCII text we don't need any parsing and the content can be seen
> +just opening the binary file.
> +
> +To collect the dump to attached storage type, one needs to write appropriate
> +value to IMEM register, in that case dumps are collected in rawdump
> +partition on the product device itself.
> +
> +One needs to read the entire rawdump partition and pull out content to
> +save it onto the attached x86_64 machine over USB. Later, this rawdump
> +can be passed to another tool (``dexter.exe`` [Qualcomm tool]) which
> +converts this into the similar binary blobs which we have got it when
> +download type was set to USB, i.e. a set of registered regions as blobs
> +and their name starts with ``md_*``.
> +
> +Replacing the ``dexter.exe`` with some open source tool can be added as future
> +scope of this document.

---
bod
Ruipeng Qi Dec. 25, 2023, 1:55 p.m. UTC | #2
<+How a kernel client driver can register region with minidump
<+------------------------------------------------------------
<+
<+Client driver can use ``qcom_minidump_region_register`` API's to register
<+and ``qcom_minidump_region_unregister`` to unregister their region from
<+minidump driver.
<+
<+Client needs to fill their region by filling ``qcom_minidump_region``
<+structure object which consists of the region name, region's virtual
<+and physical address and its size.

Hi, Mukesh, wish you a good holiday :)

I have the following idea, please help me to assess whether this can be
implemented or not. As we all know, most of the kernel objects are
allocated by the slab sub-system.I wonder if we can dump all memory
keeped by the slab sub-system? If so,  we got most of the kernel objects
which will be helpful to fix problems when we run with system issues.

How can we do this? From the description above, I think we should
register one region for each slab,  for each slab will have some pages,
and the memory between each slab is non-continuous. As we all
know, there are millions of slabs in the system, so if we dump slabs
in this way, it will introduce a heavy overhead.

I am not very familiar with qualcomm minidump, maybe my thought
is wrong. Looking forward to your reply!

Best Regards
Ruipeng
Mukesh Ojha Jan. 3, 2024, 3:27 p.m. UTC | #3
On 12/25/2023 7:25 PM, Ruipeng Qi wrote:
> <+How a kernel client driver can register region with minidump
> <+------------------------------------------------------------
> <+
> <+Client driver can use ``qcom_minidump_region_register`` API's to register
> <+and ``qcom_minidump_region_unregister`` to unregister their region from
> <+minidump driver.
> <+
> <+Client needs to fill their region by filling ``qcom_minidump_region``
> <+structure object which consists of the region name, region's virtual
> <+and physical address and its size.
> 
> Hi, Mukesh, wish you a good holiday :)

Hope you had the same..:-)

> 
> I have the following idea, please help me to assess whether this can be
> implemented or not. As we all know, most of the kernel objects are
> allocated by the slab sub-system.I wonder if we can dump all memory
> keeped by the slab sub-system? If so,  we got most of the kernel objects
> which will be helpful to fix problems when we run with system issues.
> 
> How can we do this? From the description above, I think we should
> register one region for each slab,  for each slab will have some pages,
> and the memory between each slab is non-continuous. As we all
> know, there are millions of slabs in the system, so if we dump slabs
> in this way, it will introduce a heavy overhead.
> 
> I am not very familiar with qualcomm minidump, maybe my thought
> is wrong. Looking forward to your reply!

In the current state and in simple terms, Qualcomm Minidump can not do
this, Minidump is more of a consumer driver so, what ever gets
registered with it, it can dump. Qualcomm Minidump serves bigger purpose
to dump content in any kind of crash whether it is kernel or non-kernel
like NOC errors/XPUs etc and both kernel/non-kernel entity can register 
to it, so we gets dump in any kind of system crash.

One more thing, kernel part of minidump, we are calling it APSS Minidump
has limitation of no of entries so it will be difficult to dump 
non-continuous regions after a certain number of registration ~200. However,
we do have a solution in downstream kernel for it like to create a big 
CMA buffer and register this buffer with Minidump so that whatever gets 
dumped in that buffer gets captured during crash and fill up this buffer 
and create elf during panic. I think, similar thing you are also doing 
with your OS-minidump.

I have just glanced into your implementation of OS-minidump, it
more of relying on basic concept of RAM content preserved
across boot and later reading it through procfs but this basic
stuff is common to pstore(ram) as well and pstore has file system 
support why don't you make your driver as one of pstore record and that 
way Qualcomm minidump also gets benefited where entire OS-minidump 
record gets registered with Qualcomm minidump and we get this on panic 
and you get this via pstorefs.

-Mukesh

> 
> Best Regards
> Ruipeng
Ruipeng Qi Jan. 8, 2024, 3:34 p.m. UTC | #4
On Wed, Jan 3, 2024 at 11:27 PM Mukesh Ojha <quic_mojha@quicinc.com> wrote:
>
>
> One more thing, kernel part of minidump, we are calling it APSS Minidump
> has limitation of no of entries so it will be difficult to dump
> non-continuous regions after a certain number of registration ~200. However,
> we do have a solution in downstream kernel for it like to create a big
> CMA buffer and register this buffer with Minidump so that whatever gets
> dumped in that buffer gets captured during crash and fill up this buffer
> and create elf during panic. I think, similar thing you are also doing
> with your OS-minidump.
>
> I have just glanced into your implementation of OS-minidump, it
> more of relying on basic concept of RAM content preserved
> across boot and later reading it through procfs but this basic
> stuff is common to pstore(ram) as well and pstore has file system
> support why don't you make your driver as one of pstore record and that
> way Qualcomm minidump also gets benefited where entire OS-minidump
> record gets registered with Qualcomm minidump and we get this on panic
> and you get this via pstorefs.
>
Thanks Mukesh!It is a good suggestion to move OS-minidump forward!
By the way, I have some questions here for which I need your assistance.

Firstly,I can reimplement OS-minidump as one of the pstore records to
dump data. The resulting dump file would contain thousands of
non-contiguous memory regions, each with only the virtual address and
size recorded. As far as I know, Qualcomm's minidump can handle
several memory regions, each with a physical address and size.
This seems to be a difference, and I'm curious as to how you deal with
data dumped by OS-minidump. I would really appreciate it if you could
provide more details on your approach.

Secondly, what tools do you use to analyze the dump data, and does it
support crash tool?

Lastly, is Qualcomm minidump compatible with non-Qualcomm SoCs,
and if so, how can one use it?

Best Regards
Ruipeng Qi
Mukesh Ojha Jan. 9, 2024, 8:27 a.m. UTC | #5
On 1/8/2024 9:04 PM, Ruipeng Qi wrote:
> On Wed, Jan 3, 2024 at 11:27 PM Mukesh Ojha <quic_mojha@quicinc.com> wrote:
>>
>>
>> One more thing, kernel part of minidump, we are calling it APSS Minidump
>> has limitation of no of entries so it will be difficult to dump
>> non-continuous regions after a certain number of registration ~200. However,
>> we do have a solution in downstream kernel for it like to create a big
>> CMA buffer and register this buffer with Minidump so that whatever gets
>> dumped in that buffer gets captured during crash and fill up this buffer
>> and create elf during panic. I think, similar thing you are also doing
>> with your OS-minidump.
>>
>> I have just glanced into your implementation of OS-minidump, it
>> more of relying on basic concept of RAM content preserved
>> across boot and later reading it through procfs but this basic
>> stuff is common to pstore(ram) as well and pstore has file system
>> support why don't you make your driver as one of pstore record and that
>> way Qualcomm minidump also gets benefited where entire OS-minidump
>> record gets registered with Qualcomm minidump and we get this on panic
>> and you get this via pstorefs.
>>
> Thanks Mukesh!It is a good suggestion to move OS-minidump forward!
> By the way, I have some questions here for which I need your assistance.
> 
> Firstly,I can reimplement OS-minidump as one of the pstore records to
> dump data. The resulting dump file would contain thousands of
> non-contiguous memory regions, each with only the virtual address and
> size recorded. As far as I know, Qualcomm's minidump can handle
> several memory regions, each with a physical address and size.
> This seems to be a difference, and I'm curious as to how you deal with
> data dumped by OS-minidump. I would really appreciate it if you could
> provide more details on your approach.

What my thought was to think your OS-minidump to be one of pstore record 
similar to existing records like console, ftrace, pmsg, dmesg etc.,
If you follow this series patch 11/12 and 12/12 is trying to get the
pstore(ram) record information and registering with minidump and here
the physical address are of the ramoops record addresses.

So, once you are capturing everything inside in a record, all minidump
has to do is get your Os-minidump record physical address and size
and register with minidump.
> 
> Secondly, what tools do you use to analyze the dump data, and does it
> support crash tool?

Currently, we are trying to capture only pstore ramoops region in text
format and not been using any tool.

Since, Qualcomm minidump is controlled from boot firmware and it can
not be used on non-Qualcomm SoCs so here minidump driver and its usecase
is limited to capture only pstore (ram)records for targets where RAM
content is not guaranteed to be preserved across boots.

So, you can think minidump as one of ramoops backend which will be
dumping all the ramoops regions/records/zones.

+---------+     +---------+   +--------+      +---------+
  | console |     | pmsg    |   | ftrace |     | dmesg   | ...Os-minidump
  +---------+     +---------+   +--------+     +---------+
        |             |             |              |
        |             |             |              |
        +------------------------------------------+
                           |
                          \ /
                   +----------------+
             (1)   |pstore frontends|
                   +----------------+
                           |
                          \ /
                  +------------------- +
             (2)  | pstore backend(ram)|
                  +--------------------+
                           |
                          \ /

                    +---------------+
             (3)    | qcom_minidump |
                    +---------------+


> 
> Lastly, is Qualcomm minidump compatible with non-Qualcomm SoCs,
> and if so, how can one use it?

I already replied it above.

-Mukesh

> 
> Best Regards
> Ruipeng Qi
diff mbox series

Patch

diff --git a/Documentation/admin-guide/index.rst b/Documentation/admin-guide/index.rst
index 43ea35613dfc..251d070486c2 100644
--- a/Documentation/admin-guide/index.rst
+++ b/Documentation/admin-guide/index.rst
@@ -120,6 +120,7 @@  configure specific aspects of kernel behavior to your liking.
    perf-security
    pm/index
    pnp
+   qcom_minidump
    rapidio
    ras
    rtc
diff --git a/Documentation/admin-guide/qcom_minidump.rst b/Documentation/admin-guide/qcom_minidump.rst
new file mode 100644
index 000000000000..b492f2b79639
--- /dev/null
+++ b/Documentation/admin-guide/qcom_minidump.rst
@@ -0,0 +1,272 @@ 
+Qualcomm minidump feature
+=========================
+
+Introduction
+------------
+
+Minidump is a best effort mechanism to collect useful and predefined
+data for first level of debugging on end user devices running on
+Qualcomm SoCs. It is built on the premise that System on Chip (SoC)
+or subsystem part of SoC crashes, due to a range of hardware and
+software bugs. Hence, the ability to collect accurate data is only
+a best-effort. The data collected could be invalid or corrupted, data
+collection itself could fail, and so on.
+
+Qualcomm devices in engineering mode provides a mechanism for generating
+full system RAM dumps for post-mortem debugging. But in some cases it's
+however not feasible to capture the entire content of RAM. The minidump
+mechanism provides the means for selected region should be included in
+the ramdump.
+
+
+::
+
+   +-----------------------------------------------+
+   |   DDR                       +-------------+   |
+   |                             |      SS0-ToC|   |
+   | +----------------+     +----------------+ |   |
+   | |Shared memory   |     |         SS1-ToC| |   |
+   | |(SMEM)          |     |                | |   |
+   | |                | +-->|--------+       | |   |
+   | |G-ToC           | |   | SS-ToC  \      | |   |
+   | |+-------------+ | |   | +-----------+  | |   |
+   | ||-------------| | |   | |-----------|  | |   |
+   | || SS0-ToC     | | | +-|<|SS1 region1|  | |   |
+   | ||-------------| | | | | |-----------|  | |   |
+   | || SS1-ToC     |-|>+ | | |SS1 region2|  | |   |
+   | ||-------------| |   | | |-----------|  | |   |
+   | || SS2-ToC     | |   | | |  ...      |  | |   |
+   | ||-------------| |   | | |-----------|  | |   |
+   | ||  ...        | |   |-|<|SS1 regionN|  | |   |
+   | ||-------------| |   | | |-----------|  | |   |
+   | || SSn-ToC     | |   | | +-----------+  | |   |
+   | |+-------------+ |   | |                | |   |
+   | |                |   | |----------------| |   |
+   | |                |   +>|  regionN       | |   |
+   | |                |   | |----------------| |   |
+   | +----------------+   | |                | |   |
+   |                      | |----------------| |   |
+   |                      +>|  region1       | |   |
+   |                        |----------------| |   |
+   |                        |                | |   |
+   |                        |----------------|-+   |
+   |                        |  region5       |     |
+   |                        |----------------|     |
+   |                        |                |     |
+   |  Region information    +----------------+     |
+   | +---------------+                             |
+   | |region name    |                             |
+   | |---------------|                             |
+   | |region address |                             |
+   | |---------------|                             |
+   | |region size    |                             |
+   | +---------------+                             |
+   +-----------------------------------------------+
+       G-ToC: Global table of contents
+       SS-ToC: Subsystem table of contents
+       SS0-SSn: Subsystem numbered from 0 to n
+
+It depends on SoC where the underlying firmware is keeping the
+minidump global table taking care of subsystem ToC part for
+minidump like for above diagram, it is for shared memory sitting
+in DDR and it is shared among various master however it is possible
+that this could be implemented via memory mapped regions but the
+general idea should remain same. Here, various subsystem could be
+DSP's like ADSP/CDSP/MODEM etc, along with Application processor
+(APSS) where Linux runs. DSP minidump gets collected when DSP's goes
+for recovery followed by a crash. The minidump part of code for
+that resides in ``qcom_rproc_minidump.c``.
+
+
+SMEM as backend
+----------------
+
+In this document, SMEM will be used as the backend implementation
+of minidump.
+
+The core of minidump feature is part of Qualcomm's boot firmware code.
+It initializes shared memory (SMEM), which is a part of DDR and
+allocates a small section of it to minidump table, i.e. also called
+global table of contents (G-ToC). Each subsystem (APSS, ADSP, ...) has
+its own table of segments to be included in the minidump, all
+references from a descriptor in SMEM (G-ToC). Each segment/region has
+some details like name, physical address and its size etc. and it
+could be anywhere scattered in the DDR.
+
+Qualcomm APSS Minidump kernel driver concept
+--------------------------------------------
+
+Qualcomm APSS minidump kernel driver adds the capability to add Linux
+region to be dumped as part of RAM dump collection. At the moment,
+shared memory driver creates platform device for minidump driver and
+give a means to APSS minidump to initialize itself on probe.
+
+This driver provides ``qcom_minidump_region_register`` and
+``qcom_minidump_region_unregister`` API's to register and unregister
+APSS minidump region. It also supports registration for the clients
+who came before minidump driver was initialized. It maintains pending
+list of clients who came before minidump and once minidump is initialized
+it registers them in one go.
+
+To simplify post-mortem debugging, driver creates and maintain an ELF
+header as first region that gets updated each time a new region gets
+registered.
+
+The solution supports extracting the RAM dump/minidump produced either
+over USB or stored to an attached storage device.
+
+Dependency of minidump kernel driver
+------------------------------------
+
+It is to note that whole of minidump depends on Qualcomm boot firmware
+whether it supports minidump or not. So, if the minidump SMEM ID is
+present in shared memory, it indicates that minidump is supported from
+boot firmware and it is possible to dump Linux (APSS) region as part
+of minidump collection.
+
+How a kernel client driver can register region with minidump
+------------------------------------------------------------
+
+Client driver can use ``qcom_minidump_region_register`` API's to register
+and ``qcom_minidump_region_unregister`` to unregister their region from
+minidump driver.
+
+Client needs to fill their region by filling ``qcom_minidump_region``
+structure object which consists of the region name, region's virtual
+and physical address and its size.
+
+Below, is one sample client driver snippet which tries to allocate a
+region from kernel heap of certain size and it writes a certain known
+pattern (that can help in verification after collection that we got
+the exact pattern, what we wrote) and registers it with minidump.
+
+ .. code-block:: c
+
+  #include <soc/qcom/qcom_minidump.h>
+  [...]
+
+
+  [... inside a function ...]
+  struct qcom_minidump_region region;
+
+  [...]
+
+  client_mem_region = kzalloc(region_size, GFP_KERNEL);
+  if (!client_mem_region)
+	return -ENOMEM;
+
+  [... Just write a pattern ...]
+  memset(client_mem_region, 0xAB, region_size);
+
+  [... Fill up the region object ...]
+  strlcpy(region.name, "REGION_A", sizeof(region.name));
+  region.virt_addr = client_mem_region;
+  region.phys_addr = virt_to_phys(client_mem_region);
+  region.size = region_size;
+
+  ret = qcom_minidump_region_register(&region);
+  if (ret < 0) {
+	pr_err("failed to add region in minidump: err: %d\n", ret);
+	return ret;
+  }
+
+  [...]
+
+
+Test
+----
+
+Existing Qualcomm devices already supports entire RAM dump (also called
+full dump) by writing appropriate value to Qualcomm's top control and
+status register (tcsr) in ``driver/firmware/qcom_scm.c`` .
+
+SCM device Tree bindings required to support download mode
+For example (sm8450) ::
+
+	/ {
+
+	[...]
+
+		firmware {
+			scm: scm {
+				compatible = "qcom,scm-sm8450", "qcom,scm";
+				[... tcsr register ... ]
+				qcom,dload-mode = <&tcsr 0x13000>;
+
+				[...]
+			};
+		};
+
+	[...]
+
+		soc: soc@0 {
+
+			[...]
+
+			tcsr: syscon@1fc0000 {
+				compatible = "qcom,sm8450-tcsr", "syscon";
+				reg = <0x0 0x1fc0000 0x0 0x30000>;
+			};
+
+			[...]
+		};
+	[...]
+
+	};
+
+User of minidump can pass ``qcom_scm.download_mode="mini"`` to kernel
+commandline to set the current download mode to minidump.
+Similarly, ``"full"`` is passed to set the download mode to full dump
+where entire RAM dump will be collected while setting it ``"full,mini"``
+will collect minidump along with fulldump.
+
+Writing to sysfs node can also be used to set the mode to minidump::
+
+	echo "mini" > /sys/module/qcom_scm/parameter/download_mode
+
+Once the download mode is set, any kind of crash will make the device collect
+respective dump as per set download mode.
+
+Dump collection
+---------------
+::
+
+	+-----------+
+	|           |
+	|           |         +------+
+	|           |         |      |
+	|           |         +--+---+ Product(Qualcomm SoC)
+	+-----------+             |
+	|+++++++++++|<------------+
+	|+++++++++++|    usb cable
+	+-----------+
+            x86_64 PC
+
+The solution supports a product running with Qualcomm SoC (where minidump)
+is supported from the firmware) connected to x86_64 host PC running PCAT
+tool. It supports downloading the minidump produced from product to the
+host PC over USB or to save the minidump to the product attached storage
+device(UFS/eMMC/SD Card) into minidump dedicated partition.
+
+By default, dumps are downloaded via USB to the attached x86_64 PC running
+PCAT (Qualcomm tool) software. Upon download, we will see a set of binary
+blobs starting with name ``md_*`` in PCAT configured directory in x86_64
+machine, so for above example from the client it will be ``md_REGION_A.BIN``.
+This binary blob depends on region content to determine whether it needs
+external parser support to get the content of the region, so for simple
+plain ASCII text we don't need any parsing and the content can be seen
+just opening the binary file.
+
+To collect the dump to attached storage type, one needs to write appropriate
+value to IMEM register, in that case dumps are collected in rawdump
+partition on the product device itself.
+
+One needs to read the entire rawdump partition and pull out content to
+save it onto the attached x86_64 machine over USB. Later, this rawdump
+can be passed to another tool (``dexter.exe`` [Qualcomm tool]) which
+converts this into the similar binary blobs which we have got it when
+download type was set to USB, i.e. a set of registered regions as blobs
+and their name starts with ``md_*``.
+
+Replacing the ``dexter.exe`` with some open source tool can be added as future
+scope of this document.