mbox series

[v2,00/40] Use ASCII subset instead of UTF-8 alternate symbols

Message ID cover.1620823573.git.mchehab+huawei@kernel.org
Headers show
Series Use ASCII subset instead of UTF-8 alternate symbols | expand

Message

Mauro Carvalho Chehab May 12, 2021, 12:50 p.m. UTC
This series contain basically a cleanup from all those years of converting
files to ReST.

During the conversion period, several tools like LaTeX, pandoc, DocBook
and some specially-written scripts were used in order to convert
existing documents.

Such conversion tools - plus some text editor like LibreOffice  or similar  - have
a set of rules that turns some typed ASCII characters into UTF-8 alternatives,
for instance converting commas into curly commas and adding non-breakable
spaces. All of those are meant to produce better results when the text is
displayed in HTML or PDF formats.

While it is perfectly fine to use UTF-8 characters in Linux, and specially at
the documentation,  it is better to  stick to the ASCII subset  on such
particular case,  due to a couple of reasons:

1. it makes life easier for tools like grep;
2. they easier to edit with the some commonly used text/source
   code editors.
    
Also, Sphinx already do such conversion automatically outside 
literal blocks, as described at:

       https://docutils.sourceforge.io/docs/user/smartquotes.html

In this series, the following UTF-8 symbols are replaced:

            - U+00a0 (' '): NO-BREAK SPACE
            - U+00ad ('­'): SOFT HYPHEN
            - U+00b4 ('´'): ACUTE ACCENT
            - U+00d7 ('×'): MULTIPLICATION SIGN
            - U+2010 ('‐'): HYPHEN
            - U+2018 ('‘'): LEFT SINGLE QUOTATION MARK
            - U+2019 ('’'): RIGHT SINGLE QUOTATION MARK
            - U+201c ('“'): LEFT DOUBLE QUOTATION MARK
            - U+201d ('”'): RIGHT DOUBLE QUOTATION MARK
            - U+2212 ('−'): MINUS SIGN
            - U+2217 ('∗'): ASTERISK OPERATOR
            - U+feff (''): ZERO WIDTH NO-BREAK SPACE (BOM)

---

v2:
- removed EM/EN DASH conversion from this patchset;
- removed a few fixes, as those were addressed on a separate series.
 
PS.:
   The first version of this series was posted with a different name:

	https://lore.kernel.org/lkml/cover.1620641727.git.mchehab+huawei@kernel.org/

   I also changed the patch texts, in order to better describe the patches goals.

Mauro Carvalho Chehab (40):
  docs: hwmon: Use ASCII subset instead of UTF-8 alternate symbols
  docs: admin-guide: Use ASCII subset instead of UTF-8 alternate symbols
  docs: admin-guide: media: ipu3.rst: Use ASCII subset instead of UTF-8
    alternate symbols
  docs: admin-guide: perf: imx-ddr.rst: Use ASCII subset instead of
    UTF-8 alternate symbols
  docs: admin-guide: pm: Use ASCII subset instead of UTF-8 alternate
    symbols
  docs: trace: coresight: coresight-etm4x-reference.rst: Use ASCII
    subset instead of UTF-8 alternate symbols
  docs: driver-api: ioctl.rst: Use ASCII subset instead of UTF-8
    alternate symbols
  docs: driver-api: thermal: Use ASCII subset instead of UTF-8 alternate
    symbols
  docs: driver-api: media: drivers: Use ASCII subset instead of UTF-8
    alternate symbols
  docs: driver-api: firmware: other_interfaces.rst: Use ASCII subset
    instead of UTF-8 alternate symbols
  docs: fault-injection: nvme-fault-injection.rst: Use ASCII subset
    instead of UTF-8 alternate symbols
  docs: usb: Use ASCII subset instead of UTF-8 alternate symbols
  docs: process: code-of-conduct.rst: Use ASCII subset instead of UTF-8
    alternate symbols
  docs: userspace-api: media: fdl-appendix.rst: Use ASCII subset instead
    of UTF-8 alternate symbols
  docs: userspace-api: media: v4l: Use ASCII subset instead of UTF-8
    alternate symbols
  docs: userspace-api: media: dvb: Use ASCII subset instead of UTF-8
    alternate symbols
  docs: vm: zswap.rst: Use ASCII subset instead of UTF-8 alternate
    symbols
  docs: filesystems: f2fs.rst: Use ASCII subset instead of UTF-8
    alternate symbols
  docs: filesystems: ext4: Use ASCII subset instead of UTF-8 alternate
    symbols
  docs: kernel-hacking: Use ASCII subset instead of UTF-8 alternate
    symbols
  docs: hid: Use ASCII subset instead of UTF-8 alternate symbols
  docs: security: tpm: tpm_event_log.rst: Use ASCII subset instead of
    UTF-8 alternate symbols
  docs: security: keys: trusted-encrypted.rst: Use ASCII subset instead
    of UTF-8 alternate symbols
  docs: networking: scaling.rst: Use ASCII subset instead of UTF-8
    alternate symbols
  docs: networking: devlink: devlink-dpipe.rst: Use ASCII subset instead
    of UTF-8 alternate symbols
  docs: networking: device_drivers: Use ASCII subset instead of UTF-8
    alternate symbols
  docs: x86: Use ASCII subset instead of UTF-8 alternate symbols
  docs: scheduler: sched-deadline.rst: Use ASCII subset instead of UTF-8
    alternate symbols
  docs: power: powercap: powercap.rst: Use ASCII subset instead of UTF-8
    alternate symbols
  docs: ABI: Use ASCII subset instead of UTF-8 alternate symbols
  docs: PCI: acpi-info.rst: Use ASCII subset instead of UTF-8 alternate
    symbols
  docs: gpu: Use ASCII subset instead of UTF-8 alternate symbols
  docs: sound: kernel-api: writing-an-alsa-driver.rst: Use ASCII subset
    instead of UTF-8 alternate symbols
  docs: arm64: arm-acpi.rst: Use ASCII subset instead of UTF-8 alternate
    symbols
  docs: infiniband: tag_matching.rst: Use ASCII subset instead of UTF-8
    alternate symbols
  docs: misc-devices: ibmvmc.rst: Use ASCII subset instead of UTF-8
    alternate symbols
  docs: firmware-guide: acpi: lpit.rst: Use ASCII subset instead of
    UTF-8 alternate symbols
  docs: firmware-guide: acpi: dsd: graph.rst: Use ASCII subset instead
    of UTF-8 alternate symbols
  docs: virt: kvm: api.rst: Use ASCII subset instead of UTF-8 alternate
    symbols
  docs: RCU: Use ASCII subset instead of UTF-8 alternate symbols

 ...sfs-class-chromeos-driver-cros-ec-lightbar |   2 +-
 .../ABI/testing/sysfs-devices-platform-ipmi   |   2 +-
 .../testing/sysfs-devices-platform-trackpoint |   2 +-
 Documentation/ABI/testing/sysfs-devices-soc   |   4 +-
 Documentation/PCI/acpi-info.rst               |  22 +-
 .../Data-Structures/Data-Structures.rst       |  52 ++--
 .../Expedited-Grace-Periods.rst               |  40 +--
 .../Tree-RCU-Memory-Ordering.rst              |  10 +-
 .../RCU/Design/Requirements/Requirements.rst  | 122 ++++-----
 Documentation/admin-guide/media/ipu3.rst      |   2 +-
 Documentation/admin-guide/perf/imx-ddr.rst    |   2 +-
 Documentation/admin-guide/pm/intel_idle.rst   |   4 +-
 Documentation/admin-guide/pm/intel_pstate.rst |   4 +-
 Documentation/admin-guide/ras.rst             |  86 +++---
 .../admin-guide/reporting-issues.rst          |   2 +-
 Documentation/arm64/arm-acpi.rst              |   8 +-
 .../driver-api/firmware/other_interfaces.rst  |   2 +-
 Documentation/driver-api/ioctl.rst            |   8 +-
 .../media/drivers/sh_mobile_ceu_camera.rst    |   8 +-
 .../driver-api/media/drivers/zoran.rst        |   2 +-
 .../driver-api/thermal/cpu-idle-cooling.rst   |  14 +-
 .../driver-api/thermal/intel_powerclamp.rst   |   6 +-
 .../thermal/x86_pkg_temperature_thermal.rst   |   2 +-
 .../fault-injection/nvme-fault-injection.rst  |   2 +-
 Documentation/filesystems/ext4/attributes.rst |  20 +-
 Documentation/filesystems/ext4/bigalloc.rst   |   6 +-
 Documentation/filesystems/ext4/blockgroup.rst |   8 +-
 Documentation/filesystems/ext4/blocks.rst     |   2 +-
 Documentation/filesystems/ext4/directory.rst  |  16 +-
 Documentation/filesystems/ext4/eainode.rst    |   2 +-
 Documentation/filesystems/ext4/inlinedata.rst |   6 +-
 Documentation/filesystems/ext4/inodes.rst     |   6 +-
 Documentation/filesystems/ext4/journal.rst    |   8 +-
 Documentation/filesystems/ext4/mmp.rst        |   2 +-
 .../filesystems/ext4/special_inodes.rst       |   4 +-
 Documentation/filesystems/ext4/super.rst      |  10 +-
 Documentation/filesystems/f2fs.rst            |   4 +-
 .../firmware-guide/acpi/dsd/graph.rst         |   2 +-
 Documentation/firmware-guide/acpi/lpit.rst    |   2 +-
 Documentation/gpu/i915.rst                    |   2 +-
 Documentation/gpu/komeda-kms.rst              |   2 +-
 Documentation/hid/hid-sensor.rst              |  70 ++---
 Documentation/hid/intel-ish-hid.rst           | 246 +++++++++---------
 Documentation/hwmon/ir36021.rst               |   2 +-
 Documentation/hwmon/ltc2992.rst               |   2 +-
 Documentation/hwmon/pm6764tr.rst              |   2 +-
 Documentation/infiniband/tag_matching.rst     |   4 +-
 Documentation/kernel-hacking/hacking.rst      |   2 +-
 Documentation/kernel-hacking/locking.rst      |   2 +-
 Documentation/misc-devices/ibmvmc.rst         |   8 +-
 .../device_drivers/ethernet/intel/i40e.rst    |   8 +-
 .../device_drivers/ethernet/intel/iavf.rst    |   4 +-
 .../device_drivers/ethernet/netronome/nfp.rst |  12 +-
 .../networking/devlink/devlink-dpipe.rst      |   2 +-
 Documentation/networking/scaling.rst          |  18 +-
 Documentation/power/powercap/powercap.rst     | 210 +++++++--------
 Documentation/process/code-of-conduct.rst     |   2 +-
 Documentation/scheduler/sched-deadline.rst    |   2 +-
 .../security/keys/trusted-encrypted.rst       |   4 +-
 Documentation/security/tpm/tpm_event_log.rst  |   2 +-
 .../kernel-api/writing-an-alsa-driver.rst     |  68 ++---
 .../coresight/coresight-etm4x-reference.rst   |  16 +-
 Documentation/usb/ehci.rst                    |   2 +-
 Documentation/usb/gadget_printer.rst          |   2 +-
 Documentation/usb/mass-storage.rst            |  36 +--
 .../media/dvb/audio-set-bypass-mode.rst       |   2 +-
 .../userspace-api/media/dvb/audio.rst         |   2 +-
 .../userspace-api/media/dvb/dmx-fopen.rst     |   2 +-
 .../userspace-api/media/dvb/dmx-fread.rst     |   2 +-
 .../media/dvb/dmx-set-filter.rst              |   2 +-
 .../userspace-api/media/dvb/intro.rst         |   6 +-
 .../userspace-api/media/dvb/video.rst         |   2 +-
 .../userspace-api/media/fdl-appendix.rst      |  64 ++---
 .../userspace-api/media/v4l/crop.rst          |  16 +-
 .../userspace-api/media/v4l/dev-decoder.rst   |   6 +-
 .../userspace-api/media/v4l/diff-v4l.rst      |   2 +-
 .../userspace-api/media/v4l/open.rst          |   2 +-
 .../media/v4l/vidioc-cropcap.rst              |   4 +-
 Documentation/virt/kvm/api.rst                |  28 +-
 Documentation/vm/zswap.rst                    |   4 +-
 Documentation/x86/resctrl.rst                 |   2 +-
 Documentation/x86/sgx.rst                     |   4 +-
 82 files changed, 693 insertions(+), 693 deletions(-)

Comments

Mauro Carvalho Chehab May 14, 2021, 8:21 a.m. UTC | #1
Em Wed, 12 May 2021 18:07:04 +0100
David Woodhouse <dwmw2@infradead.org> escreveu:

> On Wed, 2021-05-12 at 14:50 +0200, Mauro Carvalho Chehab wrote:
> > Such conversion tools - plus some text editor like LibreOffice  or similar  - have
> > a set of rules that turns some typed ASCII characters into UTF-8 alternatives,
> > for instance converting commas into curly commas and adding non-breakable
> > spaces. All of those are meant to produce better results when the text is
> > displayed in HTML or PDF formats.  
> 
> And don't we render our documentation into HTML or PDF formats? 

Yes.

> Are
> some of those non-breaking spaces not actually *useful* for their
> intended purpose?

No.

The thing is: non-breaking space can cause a lot of problems.

We even had to disable Sphinx usage of non-breaking space for
PDF outputs, as this was causing bad LaTeX/PDF outputs.

See, commit: 3b4c963243b1 ("docs: conf.py: adjust the LaTeX document output")

The afore mentioned patch disables Sphinx default behavior of
using NON-BREAKABLE SPACE on literal blocks and strings, using this
special setting: "parsedliteralwraps=true".

When NON-BREAKABLE SPACE were used on PDF outputs, several parts of 
the media uAPI docs were violating the document margins by far,
causing texts to be truncated.

So, please **don't add NON-BREAKABLE SPACE**, unless you test
(and keep testing it from time to time) if outputs on all
formats are properly supporting it on different Sphinx versions.

-

Also, most of those came from conversion tools, together with other
eccentricities, like the usage of U+FEFF (BOM) character at the
start of some documents. The remaining ones seem to came from 
cut-and-paste.

For instance,  bibliographic references (there are a couple of
those on media) sometimes have NON-BREAKABLE SPACE. I'm pretty
sure that those came from cut-and-pasting the document titles
from their names at the original PDF documents or web pages that
are referenced.

> > While it is perfectly fine to use UTF-8 characters in Linux, and specially at
> > the documentation,  it is better to  stick to the ASCII subset  on such
> > particular case,  due to a couple of reasons:
> > 
> > 1. it makes life easier for tools like grep;  
> 
> Barely, as noted, because of things like line feeds.

You can use grep with "-z" to seek for multi-line strings(*), Like:

	$ grep -Pzl 'grace period started,\s*then' $(find Documentation/ -type f)
	Documentation/RCU/Design/Data-Structures/Data-Structures.rst

(*) Unfortunately, while "git grep" also has a "-z" flag, it
    seems that this is (currently?) broken with regards of handling multilines:

	$ git grep -Pzl 'grace period started,\s*then'
	$

> > 2. they easier to edit with the some commonly used text/source
> >    code editors.  
> 
> That is nonsense. Any but the most broken and/or anachronistic
> environments and editors will be just fine.

Not really.

I do use a lot of UTF-8 here, as I type texts in Portuguese, but I rely
on the US-intl keyboard settings, that allow me to type as "'a" for á.
However, there's no shortcut for non-Latin UTF-codes, as far as I know.

So, if would need to type a curly comma on the text editors I normally 
use for development (vim, nano, kate), I would need to cut-and-paste
it from somewhere[1].

[1] If I have a table with UTF-8 codes handy, I could type the UTF-8 
    number manually... However, it seems that this is currently broken 
    at least on Fedora 33 (with Mate Desktop and US intl keyboard with 
    dead keys).

    Here, <CTRL><SHIFT>U is not working. No idea why. I haven't 
    test it for *years*, as I din't see any reason why I would
    need to type UTF-8 characters by numbers until we started
    this thread.
 
In practice, on the very rare cases where I needed to write
non-Latin utf-8 chars (maybe once in a year or so, Like when I
would need to use a Greek letter or some weird symbol), there changes
are high that I wouldn't remember its UTF-8 code.

So, If I need to spend time to seek for an specific symbol, after
finding it, I just cut-and-paste it.

But even in the best case scenario where I know the UTF-8 and
<CTRL><SHIFT>U works, if I wanted to use, for instance, a curly
comma, the keystroke sequence would be:

	<CTRL><SHIFT>U201csome string<CTRL><SHIFT>U201d

That's a lot harder than typing and has a higher chances of
mistakenly add a wrong symbol than just typing:

	"some string"

Knowing that both will produce *exactly* the same output, why
should I bother doing it the hard way?

-

Now, I'm not arguing that you can't use whatever UTF-8 symbol you
want on your docs. I'm just saying that, now that the conversion 
is over and a lot of documents ended getting some UTF-8 characters
by accident, it is time for a cleanup.

Thanks,
Mauro
David Woodhouse May 14, 2021, 9:06 a.m. UTC | #2
On Fri, 2021-05-14 at 10:21 +0200, Mauro Carvalho Chehab wrote:
> Em Wed, 12 May 2021 18:07:04 +0100
> David Woodhouse <dwmw2@infradead.org> escreveu:
> 
> > On Wed, 2021-05-12 at 14:50 +0200, Mauro Carvalho Chehab wrote:
> > > Such conversion tools - plus some text editor like LibreOffice  or similar  - have
> > > a set of rules that turns some typed ASCII characters into UTF-8 alternatives,
> > > for instance converting commas into curly commas and adding non-breakable
> > > spaces. All of those are meant to produce better results when the text is
> > > displayed in HTML or PDF formats.  
> > 
> > And don't we render our documentation into HTML or PDF formats? 
> 
> Yes.
> 
> > Are
> > some of those non-breaking spaces not actually *useful* for their
> > intended purpose?
> 
> No.
> 
> The thing is: non-breaking space can cause a lot of problems.
> 
> We even had to disable Sphinx usage of non-breaking space for
> PDF outputs, as this was causing bad LaTeX/PDF outputs.
> 
> See, commit: 3b4c963243b1 ("docs: conf.py: adjust the LaTeX document output")
> 
> The afore mentioned patch disables Sphinx default behavior of
> using NON-BREAKABLE SPACE on literal blocks and strings, using this
> special setting: "parsedliteralwraps=true".
> 
> When NON-BREAKABLE SPACE were used on PDF outputs, several parts of 
> the media uAPI docs were violating the document margins by far,
> causing texts to be truncated.
> 
> So, please **don't add NON-BREAKABLE SPACE**, unless you test
> (and keep testing it from time to time) if outputs on all
> formats are properly supporting it on different Sphinx versions.

And there you have a specific change with a specific fix. Nothing to do
with whether NON-BREAKABLE SPACE is ∉ ASCII, and *certainly* nothing to
do with the fact that, like *every* character in every kernel file
except the *binary* files, it's representable in UTF-8.

By all means fix the specific characters which are typographically
wrong or which, like NON-BREAKABLE SPACE, cause problems for rendering
the documentation.


> Also, most of those came from conversion tools, together with other
> eccentricities, like the usage of U+FEFF (BOM) character at the
> start of some documents. The remaining ones seem to came from 
> cut-and-paste.

... or which are just entirely redundant and gratuitous, like a BOM in
an environment where all files are UTF-8 and never 16-bit encodings
anyway.

> > > While it is perfectly fine to use UTF-8 characters in Linux, and specially at
> > > the documentation,  it is better to  stick to the ASCII subset  on such
> > > particular case,  due to a couple of reasons:
> > > 
> > > 1. it makes life easier for tools like grep;  
> > 
> > Barely, as noted, because of things like line feeds.
> 
> You can use grep with "-z" to seek for multi-line strings(*), Like:
> 
> 	$ grep -Pzl 'grace period started,\s*then' $(find Documentation/ -type f)
> 	Documentation/RCU/Design/Data-Structures/Data-Structures.rst

Yeah, right. That works if you don't just use the text that you'll have
seen in the HTML/PDF "grace period started, then", and if you instead
craft a *regex* for it, replacing the spaces with '\s*'. Or is that
[[:space:]]* if you don't want to use the experimental Perl regex
feature?

 $ grep -zlr 'grace[[:space:]]\+period[[:space:]]\+started,[[:space:]]\+then' Documentation/RCU
Documentation/RCU/Design/Data-Structures/Data-Structures.rst

And without '-l' it'll obviously just give you the whole file. No '-A5
-B5' to see the surroundings... it's hardly a useful thing, is it?

> (*) Unfortunately, while "git grep" also has a "-z" flag, it
>     seems that this is (currently?) broken with regards of handling multilines:
> 
> 	$ git grep -Pzl 'grace period started,\s*then'
> 	$

Even better. So no, multiline grep isn't really a commonly usable
feature at all.

This is why we prefer to put user-visible strings on one line in C
source code, even if it takes the lines over 80 characters — to allow
for grep to find them.

> > > 2. they easier to edit with the some commonly used text/source
> > >    code editors.  
> > 
> > That is nonsense. Any but the most broken and/or anachronistic
> > environments and editors will be just fine.
> 
> Not really.
> 
> I do use a lot of UTF-8 here, as I type texts in Portuguese, but I rely
> on the US-intl keyboard settings, that allow me to type as "'a" for á.
> However, there's no shortcut for non-Latin UTF-codes, as far as I know.
> 
> So, if would need to type a curly comma on the text editors I normally 
> use for development (vim, nano, kate), I would need to cut-and-paste
> it from somewhere[1].

That's entirely irrelevant. You don't need to be able to *type* every
character that you see in front of you, as long as your editor will
render it correctly and perhaps let you cut/paste it as you're editing
the document if you're moving things around.

> [1] If I have a table with UTF-8 codes handy, I could type the UTF-8 
>     number manually... However, it seems that this is currently broken 
>     at least on Fedora 33 (with Mate Desktop and US intl keyboard with 
>     dead keys).
> 
>     Here, <CTRL><SHIFT>U is not working. No idea why. I haven't 
>     test it for *years*, as I din't see any reason why I would
>     need to type UTF-8 characters by numbers until we started
>     this thread.

Please provide the bug number for this; I'd like to track it.

> But even in the best case scenario where I know the UTF-8 and
> <CTRL><SHIFT>U works, if I wanted to use, for instance, a curly
> comma, the keystroke sequence would be:
> 
> 	<CTRL><SHIFT>U201csome string<CTRL><SHIFT>U201d
> 
> That's a lot harder than typing and has a higher chances of
> mistakenly add a wrong symbol than just typing:
> 
> 	"some string"
> 
> Knowing that both will produce *exactly* the same output, why
> should I bother doing it the hard way?

Nobody's asked you to do it the "hard way". That's completely
irrelevant to the discussion we were having.

> Now, I'm not arguing that you can't use whatever UTF-8 symbol you
> want on your docs. I'm just saying that, now that the conversion 
> is over and a lot of documents ended getting some UTF-8 characters
> by accident, it is time for a cleanup.

All text documents are *full* of UTF-8 characters. If there is a file
in the source code which has *any* non-UTF8, we call that a 'binary
file'.

Again, if you want to make specific fixes like removing non-breaking
spaces and byte order marks, with specific reasons, then those make
sense. But it's got very little to do with UTF-8 and how easy it is to
type them. And the excuse you've put in the commit comment for your
patches is utterly bogus.
Edward Cree May 14, 2021, 11:08 a.m. UTC | #3
> On Fri, 2021-05-14 at 10:21 +0200, Mauro Carvalho Chehab wrote:
>> I do use a lot of UTF-8 here, as I type texts in Portuguese, but I rely
>> on the US-intl keyboard settings, that allow me to type as "'a" for á.
>> However, there's no shortcut for non-Latin UTF-codes, as far as I know.
>>
>> So, if would need to type a curly comma on the text editors I normally 
>> use for development (vim, nano, kate), I would need to cut-and-paste
>> it from somewhere

For anyone who doesn't know about it: X has this wonderful thing called
 the Compose key[1].  For instance, type ⎄--- to get —, or ⎄<" for “.
Much more mnemonic than Unicode codepoints; and you can extend it with
 user-defined sequences in your ~/.XCompose file.
(I assume Wayland supports all this too, but don't know the details.)

On 14/05/2021 10:06, David Woodhouse wrote:
> Again, if you want to make specific fixes like removing non-breaking
> spaces and byte order marks, with specific reasons, then those make
> sense. But it's got very little to do with UTF-8 and how easy it is to
> type them. And the excuse you've put in the commit comment for your
> patches is utterly bogus.

+1

-ed

[1] https://en.wikipedia.org/wiki/Compose_key
Mauro Carvalho Chehab May 14, 2021, 2:18 p.m. UTC | #4
Em Fri, 14 May 2021 12:08:36 +0100
Edward Cree <ecree.xilinx@gmail.com> escreveu:

> For anyone who doesn't know about it: X has this wonderful thing called
>  the Compose key[1].  For instance, type ⎄--- to get —, or ⎄<" for “.
> Much more mnemonic than Unicode codepoints; and you can extend it with
>  user-defined sequences in your ~/.XCompose file.

Good tip. I haven't use composite for years, as US-intl with dead keys is
enough for 99.999% of my needs. 

Btw, at least on Fedora with Mate, Composite is disabled by default. It has
to be enabled first using the same tool that allows changing the Keyboard
layout[1].

Yet, typing an EN DASH for example, would be "<composite>--.", with is 4
keystrokes instead of just two ('--'). It means twice the effort ;-)

[1] KDE, GNome, Mate, ... have different ways to enable it and to 
    select what key would be considered <composite>:

	https://dry.sailingissues.com/us-international-keyboard-layout.html
	https://help.ubuntu.com/community/ComposeKey

Thanks,
Mauro
Mauro Carvalho Chehab May 15, 2021, 8:22 a.m. UTC | #5
Em Fri, 14 May 2021 10:06:01 +0100
David Woodhouse <dwmw2@infradead.org> escreveu:

> On Fri, 2021-05-14 at 10:21 +0200, Mauro Carvalho Chehab wrote:
> > Em Wed, 12 May 2021 18:07:04 +0100
> > David Woodhouse <dwmw2@infradead.org> escreveu:
> >   
> > > On Wed, 2021-05-12 at 14:50 +0200, Mauro Carvalho Chehab wrote:  
> > > > Such conversion tools - plus some text editor like LibreOffice  or similar  - have
> > > > a set of rules that turns some typed ASCII characters into UTF-8 alternatives,
> > > > for instance converting commas into curly commas and adding non-breakable
> > > > spaces. All of those are meant to produce better results when the text is
> > > > displayed in HTML or PDF formats.    
> > > 
> > > And don't we render our documentation into HTML or PDF formats?   
> > 
> > Yes.
> >   
> > > Are
> > > some of those non-breaking spaces not actually *useful* for their
> > > intended purpose?  
> > 
> > No.
> > 
> > The thing is: non-breaking space can cause a lot of problems.
> > 
> > We even had to disable Sphinx usage of non-breaking space for
> > PDF outputs, as this was causing bad LaTeX/PDF outputs.
> > 
> > See, commit: 3b4c963243b1 ("docs: conf.py: adjust the LaTeX document output")
> > 
> > The afore mentioned patch disables Sphinx default behavior of
> > using NON-BREAKABLE SPACE on literal blocks and strings, using this
> > special setting: "parsedliteralwraps=true".
> > 
> > When NON-BREAKABLE SPACE were used on PDF outputs, several parts of 
> > the media uAPI docs were violating the document margins by far,
> > causing texts to be truncated.
> > 
> > So, please **don't add NON-BREAKABLE SPACE**, unless you test
> > (and keep testing it from time to time) if outputs on all
> > formats are properly supporting it on different Sphinx versions.  
> 
> And there you have a specific change with a specific fix. Nothing to do
> with whether NON-BREAKABLE SPACE is ∉ ASCII, and *certainly* nothing to
> do with the fact that, like *every* character in every kernel file
> except the *binary* files, it's representable in UTF-8.
> 
> By all means fix the specific characters which are typographically
> wrong or which, like NON-BREAKABLE SPACE, cause problems for rendering
> the documentation.
> 
> 
> > Also, most of those came from conversion tools, together with other
> > eccentricities, like the usage of U+FEFF (BOM) character at the
> > start of some documents. The remaining ones seem to came from 
> > cut-and-paste.  
> 
> ... or which are just entirely redundant and gratuitous, like a BOM in
> an environment where all files are UTF-8 and never 16-bit encodings
> anyway.

Agreed.

> 
> > > > While it is perfectly fine to use UTF-8 characters in Linux, and specially at
> > > > the documentation,  it is better to  stick to the ASCII subset  on such
> > > > particular case,  due to a couple of reasons:
> > > > 
> > > > 1. it makes life easier for tools like grep;    
> > > 
> > > Barely, as noted, because of things like line feeds.  
> > 
> > You can use grep with "-z" to seek for multi-line strings(*), Like:
> > 
> > 	$ grep -Pzl 'grace period started,\s*then' $(find Documentation/ -type f)
> > 	Documentation/RCU/Design/Data-Structures/Data-Structures.rst  
> 
> Yeah, right. That works if you don't just use the text that you'll have
> seen in the HTML/PDF "grace period started, then", and if you instead
> craft a *regex* for it, replacing the spaces with '\s*'. Or is that
> [[:space:]]* if you don't want to use the experimental Perl regex
> feature?
> 
>  $ grep -zlr 'grace[[:space:]]\+period[[:space:]]\+started,[[:space:]]\+then' Documentation/RCU
> Documentation/RCU/Design/Data-Structures/Data-Structures.rst
> 
> And without '-l' it'll obviously just give you the whole file. No '-A5
> -B5' to see the surroundings... it's hardly a useful thing, is it?
> 
> > (*) Unfortunately, while "git grep" also has a "-z" flag, it
> >     seems that this is (currently?) broken with regards of handling multilines:
> > 
> > 	$ git grep -Pzl 'grace period started,\s*then'
> > 	$  
> 
> Even better. So no, multiline grep isn't really a commonly usable
> feature at all.
> 
> This is why we prefer to put user-visible strings on one line in C
> source code, even if it takes the lines over 80 characters — to allow
> for grep to find them.

Makes sense, but in case of documentation, this is a little more
complex than that. 

Btw, the theme used when building html by default[1] has a search
box (written in Javascript) that could be able to find multi-line
patterns, working somewhat similar to "git grep foo -a bar".

[1] https://github.com/readthedocs/sphinx_rtd_theme

> > [1] If I have a table with UTF-8 codes handy, I could type the UTF-8 
> >     number manually... However, it seems that this is currently broken 
> >     at least on Fedora 33 (with Mate Desktop and US intl keyboard with 
> >     dead keys).
> > 
> >     Here, <CTRL><SHIFT>U is not working. No idea why. I haven't 
> >     test it for *years*, as I din't see any reason why I would
> >     need to type UTF-8 characters by numbers until we started
> >     this thread.  
> 
> Please provide the bug number for this; I'd like to track it.

Just opened a BZ and added you as c/c.

> > Now, I'm not arguing that you can't use whatever UTF-8 symbol you
> > want on your docs. I'm just saying that, now that the conversion 
> > is over and a lot of documents ended getting some UTF-8 characters
> > by accident, it is time for a cleanup.  
> 
> All text documents are *full* of UTF-8 characters. If there is a file
> in the source code which has *any* non-UTF8, we call that a 'binary
> file'.
> 
> Again, if you want to make specific fixes like removing non-breaking
> spaces and byte order marks, with specific reasons, then those make
> sense. But it's got very little to do with UTF-8 and how easy it is to
> type them. And the excuse you've put in the commit comment for your
> patches is utterly bogus.

Let's take one step back, in order to return to the intents of this
UTF-8, as the discussions here are not centered into the patches, but
instead, on what to do and why.

-

This discussion started originally at linux-doc ML.

While discussing about an issue when machine's locale was not set
to UTF-8 on a build VM, we discovered that some converted docs ended
with BOM characters. Those specific changes were introduced by some
of my convert patches, probably converted via pandoc.

So, I went ahead in order to check what other possible weird things
were introduced by the conversion, where several scripts and tools
were used on files that had already a different markup.

I actually checked the current UTF-8 issues, and asked people at
linux-doc to comment what of those are valid usecases, and what
should be replaced by plain ASCII.

Basically, this is the current situation (at docs/docs-next), for the
ReST files under Documentation/, excluding translations is:

1. Spaces and BOM

	- U+00a0 (' '): NO-BREAK SPACE
	- U+feff (''): ZERO WIDTH NO-BREAK SPACE (BOM)

Based on the discussions there and on this thread, those should be
dropped, as BOM is useless and NO-BREAK SPACE can cause problems
at the html/pdf output;

2. Symbols

	- U+00a9 ('©'): COPYRIGHT SIGN
	- U+00ac ('¬'): NOT SIGN
	- U+00ae ('®'): REGISTERED SIGN
	- U+00b0 ('°'): DEGREE SIGN
	- U+00b1 ('±'): PLUS-MINUS SIGN
	- U+00b2 ('²'): SUPERSCRIPT TWO
	- U+00b5 ('µ'): MICRO SIGN
	- U+03bc ('μ'): GREEK SMALL LETTER MU
	- U+00b7 ('·'): MIDDLE DOT
	- U+00bd ('½'): VULGAR FRACTION ONE HALF
	- U+2122 ('™'): TRADE MARK SIGN
	- U+2264 ('≤'): LESS-THAN OR EQUAL TO
	- U+2265 ('≥'): GREATER-THAN OR EQUAL TO
	- U+2b0d ('⬍'): UP DOWN BLACK ARROW

Those seem OK on my eyes.

On a side note, both MICRO SIGN and GREEK SMALL LETTER MU are
used several docs to represent microseconds, micro-volts and
micro-ampères. If we write an orientation document, it probably
makes sense to recommend using MICRO SIGN on such cases.

3. Latin

	- U+00c7 ('Ç'): LATIN CAPITAL LETTER C WITH CEDILLA
	- U+00df ('ß'): LATIN SMALL LETTER SHARP S
	- U+00e1 ('á'): LATIN SMALL LETTER A WITH ACUTE
	- U+00e4 ('ä'): LATIN SMALL LETTER A WITH DIAERESIS
	- U+00e6 ('æ'): LATIN SMALL LETTER AE
	- U+00e7 ('ç'): LATIN SMALL LETTER C WITH CEDILLA
	- U+00e9 ('é'): LATIN SMALL LETTER E WITH ACUTE
	- U+00ea ('ê'): LATIN SMALL LETTER E WITH CIRCUMFLEX
	- U+00eb ('ë'): LATIN SMALL LETTER E WITH DIAERESIS
	- U+00f3 ('ó'): LATIN SMALL LETTER O WITH ACUTE
	- U+00f4 ('ô'): LATIN SMALL LETTER O WITH CIRCUMFLEX
	- U+00f6 ('ö'): LATIN SMALL LETTER O WITH DIAERESIS
	- U+00f8 ('ø'): LATIN SMALL LETTER O WITH STROKE
	- U+00fa ('ú'): LATIN SMALL LETTER U WITH ACUTE
	- U+00fc ('ü'): LATIN SMALL LETTER U WITH DIAERESIS
	- U+00fd ('ý'): LATIN SMALL LETTER Y WITH ACUTE
	- U+011f ('ğ'): LATIN SMALL LETTER G WITH BREVE
	- U+0142 ('ł'): LATIN SMALL LETTER L WITH STROKE

Those should be kept as well, as they're used for non-English names.

4. arrows and box drawing symbols:
	- U+2191 ('↑'): UPWARDS ARROW
	- U+2192 ('→'): RIGHTWARDS ARROW
	- U+2193 ('↓'): DOWNWARDS ARROW

	- U+2500 ('─'): BOX DRAWINGS LIGHT HORIZONTAL
	- U+2502 ('│'): BOX DRAWINGS LIGHT VERTICAL
	- U+2514 ('└'): BOX DRAWINGS LIGHT UP AND RIGHT
	- U+251c ('├'): BOX DRAWINGS LIGHT VERTICAL AND RIGHT

Also should be kept.

In summary, based on the discussions we have so far, I suspect that
there's not much to be discussed for the above cases.

So, I'll post a v3 of this series, changing only:

	- U+00a0 (' '): NO-BREAK SPACE
	- U+feff (''): ZERO WIDTH NO-BREAK SPACE (BOM)

---

Now, this specific patch series address also this extra case:

5. curly commas:

	- U+2018 ('‘'): LEFT SINGLE QUOTATION MARK
	- U+2019 ('’'): RIGHT SINGLE QUOTATION MARK
	- U+201c ('“'): LEFT DOUBLE QUOTATION MARK
	- U+201d ('”'): RIGHT DOUBLE QUOTATION MARK

IMO, those should be replaced by ASCII commas: ' and ".

The rationale is simple: 

- most were introduced during the conversion from Docbook,
  markdown and LaTex;
- they don't add any extra value, as using "foo" of “foo” means
  the same thing;
- Sphinx already use "fancy" commas at the output. 

I guess I will put this on a separate series, as this is not a bug
fix, but just a cleanup from the conversion work.

I'll re-post those cleanups on a separate series, for patch per patch
review.

---

The remaining cases are future work, outside the scope of this v2:

6. Hyphen/Dashes and ellipsis

	- U+2212 ('−'): MINUS SIGN
	- U+00ad ('­'): SOFT HYPHEN
	- U+2010 ('‐'): HYPHEN

	    Those three are used on places where a normal ASCII hyphen/minus
	    should be used instead. There are even a couple of C files which
	    use them instead of '-' on comments.

	    IMO are fixes/cleanups from conversions and bad cut-and-paste.

	- U+2013 ('–'): EN DASH
	- U+2014 ('—'): EM DASH
	- U+2026 ('…'): HORIZONTAL ELLIPSIS

	    Those are auto-replaced by Sphinx from "--", "---" and "...",
	    respectively.

	    I guess those are a matter of personal preference about
	    weather using ASCII or UTF-8.

            My personal preference (and Ted seems to have a similar
	    opinion) is to let Sphinx do the conversion.

	    For those, I intend to post a separate series, to be
	    reviewed patch per patch, as this is really a matter
	    of personal taste. Hardly we'll reach a consensus here.

7. math symbols:

	- U+00d7 ('×'): MULTIPLICATION SIGN

	   This one is used mostly do describe video resolutions, but this is
	   on a smaller changeset than the ones that use "x" letter.

	- U+2217 ('∗'): ASTERISK OPERATOR

	   This is used only here:
		Documentation/filesystems/ext4/blockgroup.rst:filesystem size to 2^21 ∗ 2^27 = 2^48bytes or 256TiB.

	   Probably added by some conversion tool. IMO, this one should
	   also be replaced by an ASCII asterisk.

I guess I'll post a patch for the ASTERISK OPERATOR.
Thanks,
Mauro
David Woodhouse May 15, 2021, 12:02 p.m. UTC | #6
On Sat, 2021-05-15 at 13:23 +0200, Mauro Carvalho Chehab wrote:
> Em Sat, 15 May 2021 10:24:28 +0100
> David Woodhouse <dwmw2@infradead.org> escreveu:
> > > Let's take one step back, in order to return to the intents of this
> > > UTF-8, as the discussions here are not centered into the patches, but
> > > instead, on what to do and why.
> > > 
> > > This discussion started originally at linux-doc ML.
> > > 
> > > While discussing about an issue when machine's locale was not set
> > > to UTF-8 on a build VM,   
> > 
> > Stop. Stop *right* there before you go any further.
> > 
> > The machine's locale should have *nothing* to do with anything.
>
> Now, you're making a lot of wrong assumptions here ;-)
> 
> 1. I didn't report the bug. Another person reported it at linux-doc;
> 2. I fully agree with you that the building system should work fine
>    whatever locate the machine has;
> 3. Sphinx supported charset for the REST input and its output is UTF-8.

OK, fine. So that's an unrelated issue really, and just happened to be
what historically triggered the discussion. Let's set it aside.

> > > I actually checked the current UTF-8 issues … 
> > 
> > No, these aren't "UTF-8 issues". Those are *conversion* issues, and 
> > … *nothing* to do with the encoding that we happen to be using.
> 
> Yes. That's what I said.

Er… I'm fairly sure you *did* call them "UTF-8 issues". Whatever.




> > 
> > Fixing the conversion issues makes a lot of sense. Try to do it without
> > making *any* mention of UTF-8 at all.
> > 
> > > In summary, based on the discussions we have so far, I suspect that
> > > there's not much to be discussed for the above cases.
> > > 
> > > So, I'll post a v3 of this series, changing only:
> > > 
> > >         - U+00a0 (' '): NO-BREAK SPACE
> > >         - U+feff (''): ZERO WIDTH NO-BREAK SPACE (BOM)  
> > 
> > Ack, as long as those make *no* mention of UTF-8. Except perhaps to
> > note that BOM is redundant because UTF-8 doesn't have a byteorder.
> 
> I need to tell what UTF-8 codes are replaced, as otherwise the patch
> wouldn't make much sense to reviewers, as both U+00a0 and whitespaces
> are displayed the same way, and BOM is invisible.
> 

No. Again, this is *nothing* to do with UTF-8. The encoding we choose
to map between byte in the file and characters is *utterly* irrelevant
here. If we were using UTF-7, UTF-16, or even (in the case of non-
breaking space) one of the legacy 8-bit charsets that includes it like
ISO8859-1, the issue would be precisely the same. 

It's about the *character* U+00A0 NO-BREAK SPACE; nothing to do with
UTF-8 at all. Don't mention UTF-8. It's *irrelevant* and just shows
that you can't actually bothered to stop and do any critical thinking
about the matter at all.

As I said, the only time that it makes sense to mention UTF-8 in this
context is when talking about *why* the BOM is not needed. And even
then, you could say "because we *aren't* using an encoding where
endianness matters, such as UTF-16", instead of actually mentioning
UTF-8. Try it ☺

> > 
> > > ---
> > > 
> > > Now, this specific patch series address also this extra case:
> > > 
> > > 5. curly commas:
> > > 
> > >         - U+2018 ('‘'): LEFT SINGLE QUOTATION MARK
> > >         - U+2019 ('’'): RIGHT SINGLE QUOTATION MARK
> > >         - U+201c ('“'): LEFT DOUBLE QUOTATION MARK
> > >         - U+201d ('”'): RIGHT DOUBLE QUOTATION MARK
> > > 
> > > IMO, those should be replaced by ASCII commas: ' and ".
> > > 
> > > The rationale is simple: 
> > > 
> > > - most were introduced during the conversion from Docbook,
> > >   markdown and LaTex;
> > > - they don't add any extra value, as using "foo" of “foo” means
> > >   the same thing;
> > > - Sphinx already use "fancy" commas at the output. 
> > > 
> > > I guess I will put this on a separate series, as this is not a bug
> > > fix, but just a cleanup from the conversion work.
> > > 
> > > I'll re-post those cleanups on a separate series, for patch per patch
> > > review.  
> > 
> > Makes sense. 
> > 
> > The left/right quotation marks exists to make human-readable text much
> > easier to read, but the key point here is that they are redundant
> > because the tooling already emits them in the *output* so they don't
> > need to be in the source, yes?
> 
> Yes.
> 
> > As long as the tooling gets it *right* and uses them where it should,
> > that seems sane enough.
> > 
> > However, it *does* break 'grep', because if I cut/paste a snippet from
> > the documentation and try to grep for it, it'll no longer match.
> > 
> > Consistency is good, but perhaps we should actually be consistent the
> > other way round and always use the left/right versions in the source
> > *instead* of relying on the tooling, to make searches work better?
> > You claimed to care about that, right?
> 
> That's indeed a good point. It would be interesting to have more
> opinions with that matter.
> 
> There are a couple of things to consider:
> 
> 1. It is (usually) trivial to discover what document produced a
>    certain page at the documentation.
> 
>    For instance, if you want to know where the text under this
>    file came from, or to grep a text from it:
> 
> 	https://www.kernel.org/doc/html/latest/admin-guide/cgroup-v2.html
> 
>    You can click at the "View page source" button at the first line.
>    It will show the .rst file used to produce it:
> 
> 	https://www.kernel.org/doc/html/latest/_sources/admin-guide/cgroup-v2.rst.txt
> 
> 2. If all you want is to search for a text inside the docs,
>    you can click at the "Search docs" box, which is part of the
>    Read the Docs theme.
> 
> 3. Kernel has several extensions for Sphinx, in order to make life 
>    easier for Kernel developers:
> 
> 	Documentation/sphinx/automarkup.py
> 	Documentation/sphinx/cdomain.py
> 	Documentation/sphinx/kernel_abi.py
> 	Documentation/sphinx/kernel_feat.py
> 	Documentation/sphinx/kernel_include.py
> 	Documentation/sphinx/kerneldoc.py
> 	Documentation/sphinx/kernellog.py
> 	Documentation/sphinx/kfigure.py
> 	Documentation/sphinx/load_config.py
> 	Documentation/sphinx/maintainers_include.py
> 	Documentation/sphinx/rstFlatTable.py
> 
> Those (in particular automarkup and kerneldoc) will also dynamically 
> change things during ReST conversion, which may cause grep to not work. 
> 
> 5. some PDF tools like evince will match curly commas if you
>    type an ASCII comma on their search boxes.
> 
> 6. Some developers prefer to only deal with the files inside the
>    Kernel tree. Those are very unlikely to do grep with curly aspas.
> 
> My opinion on that matter is that we should make life easier for
> developers to grep on text files, as the ones using the web interface
> are already served by the search box in html format or by tools like
> evince.
> 
> So, my vote here is to keep aspas as plain ASCII.

OK, but all your reasoning is about the *character* used, not the
encoding. So try to do it without mentioning ASCII, and especially
without mentioning UTF-8.

Your point is that the *character* is the one easily reachable on
standard keyboard layouts, and the one which people are most likely to
enter manually. It has *nothing* to do with charset encodings, so don't
conflate is with talking about charset encodings.

> 
> > 
> > > The remaining cases are future work, outside the scope of this v2:
> > > 
> > > 6. Hyphen/Dashes and ellipsis
> > > 
> > >         - U+2212 ('−'): MINUS SIGN
> > >         - U+00ad ('­'): SOFT HYPHEN
> > >         - U+2010 ('‐'): HYPHEN
> > > 
> > >             Those three are used on places where a normal ASCII hyphen/minus
> > >             should be used instead. There are even a couple of C files which
> > >             use them instead of '-' on comments.
> > > 
> > >             IMO are fixes/cleanups from conversions and bad cut-and-paste.  
> > 
> > That seems to make sense.
> > 
> > >         - U+2013 ('–'): EN DASH
> > >         - U+2014 ('—'): EM DASH
> > >         - U+2026 ('…'): HORIZONTAL ELLIPSIS
> > > 
> > >             Those are auto-replaced by Sphinx from "--", "---" and "...",
> > >             respectively.
> > > 
> > >             I guess those are a matter of personal preference about
> > >             weather using ASCII or UTF-8.
> > > 
> > >             My personal preference (and Ted seems to have a similar
> > >             opinion) is to let Sphinx do the conversion.
> > > 
> > >             For those, I intend to post a separate series, to be
> > >             reviewed patch per patch, as this is really a matter
> > >             of personal taste. Hardly we'll reach a consensus here.
> > >   
> > 
> > Again using the trigraph-like '--' and '...' instead of just using the
> > plain text '—' and '…' breaks searching, because what's in the output
> > doesn't match the input. Again consistency is good, but perhaps we
> > should standardise on just putting these in their plain text form
> > instead of the trigraphs?
> 
> Good point. 
> 
> While I don't have any strong preferences here, there's something that
> annoys me with regards to EM/EN DASH:
> 
> With the monospaced fonts I'm using here - both at my e-mailer and
> on my terminals, both EM and EN DASH are displayed look *exactly*
> the same.

Interesting. They definitely show differently in my terminal, and in
the monospaced font in email.