Message ID | cover.1620823573.git.mchehab+huawei@kernel.org |
---|---|
Headers | show |
Series | Use ASCII subset instead of UTF-8 alternate symbols | expand |
Em Wed, 12 May 2021 18:07:04 +0100 David Woodhouse <dwmw2@infradead.org> escreveu: > On Wed, 2021-05-12 at 14:50 +0200, Mauro Carvalho Chehab wrote: > > Such conversion tools - plus some text editor like LibreOffice or similar - have > > a set of rules that turns some typed ASCII characters into UTF-8 alternatives, > > for instance converting commas into curly commas and adding non-breakable > > spaces. All of those are meant to produce better results when the text is > > displayed in HTML or PDF formats. > > And don't we render our documentation into HTML or PDF formats? Yes. > Are > some of those non-breaking spaces not actually *useful* for their > intended purpose? No. The thing is: non-breaking space can cause a lot of problems. We even had to disable Sphinx usage of non-breaking space for PDF outputs, as this was causing bad LaTeX/PDF outputs. See, commit: 3b4c963243b1 ("docs: conf.py: adjust the LaTeX document output") The afore mentioned patch disables Sphinx default behavior of using NON-BREAKABLE SPACE on literal blocks and strings, using this special setting: "parsedliteralwraps=true". When NON-BREAKABLE SPACE were used on PDF outputs, several parts of the media uAPI docs were violating the document margins by far, causing texts to be truncated. So, please **don't add NON-BREAKABLE SPACE**, unless you test (and keep testing it from time to time) if outputs on all formats are properly supporting it on different Sphinx versions. - Also, most of those came from conversion tools, together with other eccentricities, like the usage of U+FEFF (BOM) character at the start of some documents. The remaining ones seem to came from cut-and-paste. For instance, bibliographic references (there are a couple of those on media) sometimes have NON-BREAKABLE SPACE. I'm pretty sure that those came from cut-and-pasting the document titles from their names at the original PDF documents or web pages that are referenced. > > While it is perfectly fine to use UTF-8 characters in Linux, and specially at > > the documentation, it is better to stick to the ASCII subset on such > > particular case, due to a couple of reasons: > > > > 1. it makes life easier for tools like grep; > > Barely, as noted, because of things like line feeds. You can use grep with "-z" to seek for multi-line strings(*), Like: $ grep -Pzl 'grace period started,\s*then' $(find Documentation/ -type f) Documentation/RCU/Design/Data-Structures/Data-Structures.rst (*) Unfortunately, while "git grep" also has a "-z" flag, it seems that this is (currently?) broken with regards of handling multilines: $ git grep -Pzl 'grace period started,\s*then' $ > > 2. they easier to edit with the some commonly used text/source > > code editors. > > That is nonsense. Any but the most broken and/or anachronistic > environments and editors will be just fine. Not really. I do use a lot of UTF-8 here, as I type texts in Portuguese, but I rely on the US-intl keyboard settings, that allow me to type as "'a" for á. However, there's no shortcut for non-Latin UTF-codes, as far as I know. So, if would need to type a curly comma on the text editors I normally use for development (vim, nano, kate), I would need to cut-and-paste it from somewhere[1]. [1] If I have a table with UTF-8 codes handy, I could type the UTF-8 number manually... However, it seems that this is currently broken at least on Fedora 33 (with Mate Desktop and US intl keyboard with dead keys). Here, <CTRL><SHIFT>U is not working. No idea why. I haven't test it for *years*, as I din't see any reason why I would need to type UTF-8 characters by numbers until we started this thread. In practice, on the very rare cases where I needed to write non-Latin utf-8 chars (maybe once in a year or so, Like when I would need to use a Greek letter or some weird symbol), there changes are high that I wouldn't remember its UTF-8 code. So, If I need to spend time to seek for an specific symbol, after finding it, I just cut-and-paste it. But even in the best case scenario where I know the UTF-8 and <CTRL><SHIFT>U works, if I wanted to use, for instance, a curly comma, the keystroke sequence would be: <CTRL><SHIFT>U201csome string<CTRL><SHIFT>U201d That's a lot harder than typing and has a higher chances of mistakenly add a wrong symbol than just typing: "some string" Knowing that both will produce *exactly* the same output, why should I bother doing it the hard way? - Now, I'm not arguing that you can't use whatever UTF-8 symbol you want on your docs. I'm just saying that, now that the conversion is over and a lot of documents ended getting some UTF-8 characters by accident, it is time for a cleanup. Thanks, Mauro
On Fri, 2021-05-14 at 10:21 +0200, Mauro Carvalho Chehab wrote: > Em Wed, 12 May 2021 18:07:04 +0100 > David Woodhouse <dwmw2@infradead.org> escreveu: > > > On Wed, 2021-05-12 at 14:50 +0200, Mauro Carvalho Chehab wrote: > > > Such conversion tools - plus some text editor like LibreOffice or similar - have > > > a set of rules that turns some typed ASCII characters into UTF-8 alternatives, > > > for instance converting commas into curly commas and adding non-breakable > > > spaces. All of those are meant to produce better results when the text is > > > displayed in HTML or PDF formats. > > > > And don't we render our documentation into HTML or PDF formats? > > Yes. > > > Are > > some of those non-breaking spaces not actually *useful* for their > > intended purpose? > > No. > > The thing is: non-breaking space can cause a lot of problems. > > We even had to disable Sphinx usage of non-breaking space for > PDF outputs, as this was causing bad LaTeX/PDF outputs. > > See, commit: 3b4c963243b1 ("docs: conf.py: adjust the LaTeX document output") > > The afore mentioned patch disables Sphinx default behavior of > using NON-BREAKABLE SPACE on literal blocks and strings, using this > special setting: "parsedliteralwraps=true". > > When NON-BREAKABLE SPACE were used on PDF outputs, several parts of > the media uAPI docs were violating the document margins by far, > causing texts to be truncated. > > So, please **don't add NON-BREAKABLE SPACE**, unless you test > (and keep testing it from time to time) if outputs on all > formats are properly supporting it on different Sphinx versions. And there you have a specific change with a specific fix. Nothing to do with whether NON-BREAKABLE SPACE is ∉ ASCII, and *certainly* nothing to do with the fact that, like *every* character in every kernel file except the *binary* files, it's representable in UTF-8. By all means fix the specific characters which are typographically wrong or which, like NON-BREAKABLE SPACE, cause problems for rendering the documentation. > Also, most of those came from conversion tools, together with other > eccentricities, like the usage of U+FEFF (BOM) character at the > start of some documents. The remaining ones seem to came from > cut-and-paste. ... or which are just entirely redundant and gratuitous, like a BOM in an environment where all files are UTF-8 and never 16-bit encodings anyway. > > > While it is perfectly fine to use UTF-8 characters in Linux, and specially at > > > the documentation, it is better to stick to the ASCII subset on such > > > particular case, due to a couple of reasons: > > > > > > 1. it makes life easier for tools like grep; > > > > Barely, as noted, because of things like line feeds. > > You can use grep with "-z" to seek for multi-line strings(*), Like: > > $ grep -Pzl 'grace period started,\s*then' $(find Documentation/ -type f) > Documentation/RCU/Design/Data-Structures/Data-Structures.rst Yeah, right. That works if you don't just use the text that you'll have seen in the HTML/PDF "grace period started, then", and if you instead craft a *regex* for it, replacing the spaces with '\s*'. Or is that [[:space:]]* if you don't want to use the experimental Perl regex feature? $ grep -zlr 'grace[[:space:]]\+period[[:space:]]\+started,[[:space:]]\+then' Documentation/RCU Documentation/RCU/Design/Data-Structures/Data-Structures.rst And without '-l' it'll obviously just give you the whole file. No '-A5 -B5' to see the surroundings... it's hardly a useful thing, is it? > (*) Unfortunately, while "git grep" also has a "-z" flag, it > seems that this is (currently?) broken with regards of handling multilines: > > $ git grep -Pzl 'grace period started,\s*then' > $ Even better. So no, multiline grep isn't really a commonly usable feature at all. This is why we prefer to put user-visible strings on one line in C source code, even if it takes the lines over 80 characters — to allow for grep to find them. > > > 2. they easier to edit with the some commonly used text/source > > > code editors. > > > > That is nonsense. Any but the most broken and/or anachronistic > > environments and editors will be just fine. > > Not really. > > I do use a lot of UTF-8 here, as I type texts in Portuguese, but I rely > on the US-intl keyboard settings, that allow me to type as "'a" for á. > However, there's no shortcut for non-Latin UTF-codes, as far as I know. > > So, if would need to type a curly comma on the text editors I normally > use for development (vim, nano, kate), I would need to cut-and-paste > it from somewhere[1]. That's entirely irrelevant. You don't need to be able to *type* every character that you see in front of you, as long as your editor will render it correctly and perhaps let you cut/paste it as you're editing the document if you're moving things around. > [1] If I have a table with UTF-8 codes handy, I could type the UTF-8 > number manually... However, it seems that this is currently broken > at least on Fedora 33 (with Mate Desktop and US intl keyboard with > dead keys). > > Here, <CTRL><SHIFT>U is not working. No idea why. I haven't > test it for *years*, as I din't see any reason why I would > need to type UTF-8 characters by numbers until we started > this thread. Please provide the bug number for this; I'd like to track it. > But even in the best case scenario where I know the UTF-8 and > <CTRL><SHIFT>U works, if I wanted to use, for instance, a curly > comma, the keystroke sequence would be: > > <CTRL><SHIFT>U201csome string<CTRL><SHIFT>U201d > > That's a lot harder than typing and has a higher chances of > mistakenly add a wrong symbol than just typing: > > "some string" > > Knowing that both will produce *exactly* the same output, why > should I bother doing it the hard way? Nobody's asked you to do it the "hard way". That's completely irrelevant to the discussion we were having. > Now, I'm not arguing that you can't use whatever UTF-8 symbol you > want on your docs. I'm just saying that, now that the conversion > is over and a lot of documents ended getting some UTF-8 characters > by accident, it is time for a cleanup. All text documents are *full* of UTF-8 characters. If there is a file in the source code which has *any* non-UTF8, we call that a 'binary file'. Again, if you want to make specific fixes like removing non-breaking spaces and byte order marks, with specific reasons, then those make sense. But it's got very little to do with UTF-8 and how easy it is to type them. And the excuse you've put in the commit comment for your patches is utterly bogus.
> On Fri, 2021-05-14 at 10:21 +0200, Mauro Carvalho Chehab wrote: >> I do use a lot of UTF-8 here, as I type texts in Portuguese, but I rely >> on the US-intl keyboard settings, that allow me to type as "'a" for á. >> However, there's no shortcut for non-Latin UTF-codes, as far as I know. >> >> So, if would need to type a curly comma on the text editors I normally >> use for development (vim, nano, kate), I would need to cut-and-paste >> it from somewhere For anyone who doesn't know about it: X has this wonderful thing called the Compose key[1]. For instance, type ⎄--- to get —, or ⎄<" for “. Much more mnemonic than Unicode codepoints; and you can extend it with user-defined sequences in your ~/.XCompose file. (I assume Wayland supports all this too, but don't know the details.) On 14/05/2021 10:06, David Woodhouse wrote: > Again, if you want to make specific fixes like removing non-breaking > spaces and byte order marks, with specific reasons, then those make > sense. But it's got very little to do with UTF-8 and how easy it is to > type them. And the excuse you've put in the commit comment for your > patches is utterly bogus. +1 -ed [1] https://en.wikipedia.org/wiki/Compose_key
Em Fri, 14 May 2021 12:08:36 +0100 Edward Cree <ecree.xilinx@gmail.com> escreveu: > For anyone who doesn't know about it: X has this wonderful thing called > the Compose key[1]. For instance, type ⎄--- to get —, or ⎄<" for “. > Much more mnemonic than Unicode codepoints; and you can extend it with > user-defined sequences in your ~/.XCompose file. Good tip. I haven't use composite for years, as US-intl with dead keys is enough for 99.999% of my needs. Btw, at least on Fedora with Mate, Composite is disabled by default. It has to be enabled first using the same tool that allows changing the Keyboard layout[1]. Yet, typing an EN DASH for example, would be "<composite>--.", with is 4 keystrokes instead of just two ('--'). It means twice the effort ;-) [1] KDE, GNome, Mate, ... have different ways to enable it and to select what key would be considered <composite>: https://dry.sailingissues.com/us-international-keyboard-layout.html https://help.ubuntu.com/community/ComposeKey Thanks, Mauro
Em Fri, 14 May 2021 10:06:01 +0100 David Woodhouse <dwmw2@infradead.org> escreveu: > On Fri, 2021-05-14 at 10:21 +0200, Mauro Carvalho Chehab wrote: > > Em Wed, 12 May 2021 18:07:04 +0100 > > David Woodhouse <dwmw2@infradead.org> escreveu: > > > > > On Wed, 2021-05-12 at 14:50 +0200, Mauro Carvalho Chehab wrote: > > > > Such conversion tools - plus some text editor like LibreOffice or similar - have > > > > a set of rules that turns some typed ASCII characters into UTF-8 alternatives, > > > > for instance converting commas into curly commas and adding non-breakable > > > > spaces. All of those are meant to produce better results when the text is > > > > displayed in HTML or PDF formats. > > > > > > And don't we render our documentation into HTML or PDF formats? > > > > Yes. > > > > > Are > > > some of those non-breaking spaces not actually *useful* for their > > > intended purpose? > > > > No. > > > > The thing is: non-breaking space can cause a lot of problems. > > > > We even had to disable Sphinx usage of non-breaking space for > > PDF outputs, as this was causing bad LaTeX/PDF outputs. > > > > See, commit: 3b4c963243b1 ("docs: conf.py: adjust the LaTeX document output") > > > > The afore mentioned patch disables Sphinx default behavior of > > using NON-BREAKABLE SPACE on literal blocks and strings, using this > > special setting: "parsedliteralwraps=true". > > > > When NON-BREAKABLE SPACE were used on PDF outputs, several parts of > > the media uAPI docs were violating the document margins by far, > > causing texts to be truncated. > > > > So, please **don't add NON-BREAKABLE SPACE**, unless you test > > (and keep testing it from time to time) if outputs on all > > formats are properly supporting it on different Sphinx versions. > > And there you have a specific change with a specific fix. Nothing to do > with whether NON-BREAKABLE SPACE is ∉ ASCII, and *certainly* nothing to > do with the fact that, like *every* character in every kernel file > except the *binary* files, it's representable in UTF-8. > > By all means fix the specific characters which are typographically > wrong or which, like NON-BREAKABLE SPACE, cause problems for rendering > the documentation. > > > > Also, most of those came from conversion tools, together with other > > eccentricities, like the usage of U+FEFF (BOM) character at the > > start of some documents. The remaining ones seem to came from > > cut-and-paste. > > ... or which are just entirely redundant and gratuitous, like a BOM in > an environment where all files are UTF-8 and never 16-bit encodings > anyway. Agreed. > > > > > While it is perfectly fine to use UTF-8 characters in Linux, and specially at > > > > the documentation, it is better to stick to the ASCII subset on such > > > > particular case, due to a couple of reasons: > > > > > > > > 1. it makes life easier for tools like grep; > > > > > > Barely, as noted, because of things like line feeds. > > > > You can use grep with "-z" to seek for multi-line strings(*), Like: > > > > $ grep -Pzl 'grace period started,\s*then' $(find Documentation/ -type f) > > Documentation/RCU/Design/Data-Structures/Data-Structures.rst > > Yeah, right. That works if you don't just use the text that you'll have > seen in the HTML/PDF "grace period started, then", and if you instead > craft a *regex* for it, replacing the spaces with '\s*'. Or is that > [[:space:]]* if you don't want to use the experimental Perl regex > feature? > > $ grep -zlr 'grace[[:space:]]\+period[[:space:]]\+started,[[:space:]]\+then' Documentation/RCU > Documentation/RCU/Design/Data-Structures/Data-Structures.rst > > And without '-l' it'll obviously just give you the whole file. No '-A5 > -B5' to see the surroundings... it's hardly a useful thing, is it? > > > (*) Unfortunately, while "git grep" also has a "-z" flag, it > > seems that this is (currently?) broken with regards of handling multilines: > > > > $ git grep -Pzl 'grace period started,\s*then' > > $ > > Even better. So no, multiline grep isn't really a commonly usable > feature at all. > > This is why we prefer to put user-visible strings on one line in C > source code, even if it takes the lines over 80 characters — to allow > for grep to find them. Makes sense, but in case of documentation, this is a little more complex than that. Btw, the theme used when building html by default[1] has a search box (written in Javascript) that could be able to find multi-line patterns, working somewhat similar to "git grep foo -a bar". [1] https://github.com/readthedocs/sphinx_rtd_theme > > [1] If I have a table with UTF-8 codes handy, I could type the UTF-8 > > number manually... However, it seems that this is currently broken > > at least on Fedora 33 (with Mate Desktop and US intl keyboard with > > dead keys). > > > > Here, <CTRL><SHIFT>U is not working. No idea why. I haven't > > test it for *years*, as I din't see any reason why I would > > need to type UTF-8 characters by numbers until we started > > this thread. > > Please provide the bug number for this; I'd like to track it. Just opened a BZ and added you as c/c. > > Now, I'm not arguing that you can't use whatever UTF-8 symbol you > > want on your docs. I'm just saying that, now that the conversion > > is over and a lot of documents ended getting some UTF-8 characters > > by accident, it is time for a cleanup. > > All text documents are *full* of UTF-8 characters. If there is a file > in the source code which has *any* non-UTF8, we call that a 'binary > file'. > > Again, if you want to make specific fixes like removing non-breaking > spaces and byte order marks, with specific reasons, then those make > sense. But it's got very little to do with UTF-8 and how easy it is to > type them. And the excuse you've put in the commit comment for your > patches is utterly bogus. Let's take one step back, in order to return to the intents of this UTF-8, as the discussions here are not centered into the patches, but instead, on what to do and why. - This discussion started originally at linux-doc ML. While discussing about an issue when machine's locale was not set to UTF-8 on a build VM, we discovered that some converted docs ended with BOM characters. Those specific changes were introduced by some of my convert patches, probably converted via pandoc. So, I went ahead in order to check what other possible weird things were introduced by the conversion, where several scripts and tools were used on files that had already a different markup. I actually checked the current UTF-8 issues, and asked people at linux-doc to comment what of those are valid usecases, and what should be replaced by plain ASCII. Basically, this is the current situation (at docs/docs-next), for the ReST files under Documentation/, excluding translations is: 1. Spaces and BOM - U+00a0 (' '): NO-BREAK SPACE - U+feff (''): ZERO WIDTH NO-BREAK SPACE (BOM) Based on the discussions there and on this thread, those should be dropped, as BOM is useless and NO-BREAK SPACE can cause problems at the html/pdf output; 2. Symbols - U+00a9 ('©'): COPYRIGHT SIGN - U+00ac ('¬'): NOT SIGN - U+00ae ('®'): REGISTERED SIGN - U+00b0 ('°'): DEGREE SIGN - U+00b1 ('±'): PLUS-MINUS SIGN - U+00b2 ('²'): SUPERSCRIPT TWO - U+00b5 ('µ'): MICRO SIGN - U+03bc ('μ'): GREEK SMALL LETTER MU - U+00b7 ('·'): MIDDLE DOT - U+00bd ('½'): VULGAR FRACTION ONE HALF - U+2122 ('™'): TRADE MARK SIGN - U+2264 ('≤'): LESS-THAN OR EQUAL TO - U+2265 ('≥'): GREATER-THAN OR EQUAL TO - U+2b0d ('⬍'): UP DOWN BLACK ARROW Those seem OK on my eyes. On a side note, both MICRO SIGN and GREEK SMALL LETTER MU are used several docs to represent microseconds, micro-volts and micro-ampères. If we write an orientation document, it probably makes sense to recommend using MICRO SIGN on such cases. 3. Latin - U+00c7 ('Ç'): LATIN CAPITAL LETTER C WITH CEDILLA - U+00df ('ß'): LATIN SMALL LETTER SHARP S - U+00e1 ('á'): LATIN SMALL LETTER A WITH ACUTE - U+00e4 ('ä'): LATIN SMALL LETTER A WITH DIAERESIS - U+00e6 ('æ'): LATIN SMALL LETTER AE - U+00e7 ('ç'): LATIN SMALL LETTER C WITH CEDILLA - U+00e9 ('é'): LATIN SMALL LETTER E WITH ACUTE - U+00ea ('ê'): LATIN SMALL LETTER E WITH CIRCUMFLEX - U+00eb ('ë'): LATIN SMALL LETTER E WITH DIAERESIS - U+00f3 ('ó'): LATIN SMALL LETTER O WITH ACUTE - U+00f4 ('ô'): LATIN SMALL LETTER O WITH CIRCUMFLEX - U+00f6 ('ö'): LATIN SMALL LETTER O WITH DIAERESIS - U+00f8 ('ø'): LATIN SMALL LETTER O WITH STROKE - U+00fa ('ú'): LATIN SMALL LETTER U WITH ACUTE - U+00fc ('ü'): LATIN SMALL LETTER U WITH DIAERESIS - U+00fd ('ý'): LATIN SMALL LETTER Y WITH ACUTE - U+011f ('ğ'): LATIN SMALL LETTER G WITH BREVE - U+0142 ('ł'): LATIN SMALL LETTER L WITH STROKE Those should be kept as well, as they're used for non-English names. 4. arrows and box drawing symbols: - U+2191 ('↑'): UPWARDS ARROW - U+2192 ('→'): RIGHTWARDS ARROW - U+2193 ('↓'): DOWNWARDS ARROW - U+2500 ('─'): BOX DRAWINGS LIGHT HORIZONTAL - U+2502 ('│'): BOX DRAWINGS LIGHT VERTICAL - U+2514 ('└'): BOX DRAWINGS LIGHT UP AND RIGHT - U+251c ('├'): BOX DRAWINGS LIGHT VERTICAL AND RIGHT Also should be kept. In summary, based on the discussions we have so far, I suspect that there's not much to be discussed for the above cases. So, I'll post a v3 of this series, changing only: - U+00a0 (' '): NO-BREAK SPACE - U+feff (''): ZERO WIDTH NO-BREAK SPACE (BOM) --- Now, this specific patch series address also this extra case: 5. curly commas: - U+2018 ('‘'): LEFT SINGLE QUOTATION MARK - U+2019 ('’'): RIGHT SINGLE QUOTATION MARK - U+201c ('“'): LEFT DOUBLE QUOTATION MARK - U+201d ('”'): RIGHT DOUBLE QUOTATION MARK IMO, those should be replaced by ASCII commas: ' and ". The rationale is simple: - most were introduced during the conversion from Docbook, markdown and LaTex; - they don't add any extra value, as using "foo" of “foo” means the same thing; - Sphinx already use "fancy" commas at the output. I guess I will put this on a separate series, as this is not a bug fix, but just a cleanup from the conversion work. I'll re-post those cleanups on a separate series, for patch per patch review. --- The remaining cases are future work, outside the scope of this v2: 6. Hyphen/Dashes and ellipsis - U+2212 ('−'): MINUS SIGN - U+00ad (''): SOFT HYPHEN - U+2010 ('‐'): HYPHEN Those three are used on places where a normal ASCII hyphen/minus should be used instead. There are even a couple of C files which use them instead of '-' on comments. IMO are fixes/cleanups from conversions and bad cut-and-paste. - U+2013 ('–'): EN DASH - U+2014 ('—'): EM DASH - U+2026 ('…'): HORIZONTAL ELLIPSIS Those are auto-replaced by Sphinx from "--", "---" and "...", respectively. I guess those are a matter of personal preference about weather using ASCII or UTF-8. My personal preference (and Ted seems to have a similar opinion) is to let Sphinx do the conversion. For those, I intend to post a separate series, to be reviewed patch per patch, as this is really a matter of personal taste. Hardly we'll reach a consensus here. 7. math symbols: - U+00d7 ('×'): MULTIPLICATION SIGN This one is used mostly do describe video resolutions, but this is on a smaller changeset than the ones that use "x" letter. - U+2217 ('∗'): ASTERISK OPERATOR This is used only here: Documentation/filesystems/ext4/blockgroup.rst:filesystem size to 2^21 ∗ 2^27 = 2^48bytes or 256TiB. Probably added by some conversion tool. IMO, this one should also be replaced by an ASCII asterisk. I guess I'll post a patch for the ASTERISK OPERATOR. Thanks, Mauro
On Sat, 2021-05-15 at 13:23 +0200, Mauro Carvalho Chehab wrote: > Em Sat, 15 May 2021 10:24:28 +0100 > David Woodhouse <dwmw2@infradead.org> escreveu: > > > Let's take one step back, in order to return to the intents of this > > > UTF-8, as the discussions here are not centered into the patches, but > > > instead, on what to do and why. > > > > > > This discussion started originally at linux-doc ML. > > > > > > While discussing about an issue when machine's locale was not set > > > to UTF-8 on a build VM, > > > > Stop. Stop *right* there before you go any further. > > > > The machine's locale should have *nothing* to do with anything. > > Now, you're making a lot of wrong assumptions here ;-) > > 1. I didn't report the bug. Another person reported it at linux-doc; > 2. I fully agree with you that the building system should work fine > whatever locate the machine has; > 3. Sphinx supported charset for the REST input and its output is UTF-8. OK, fine. So that's an unrelated issue really, and just happened to be what historically triggered the discussion. Let's set it aside. > > > I actually checked the current UTF-8 issues … > > > > No, these aren't "UTF-8 issues". Those are *conversion* issues, and > > … *nothing* to do with the encoding that we happen to be using. > > Yes. That's what I said. Er… I'm fairly sure you *did* call them "UTF-8 issues". Whatever. > > > > Fixing the conversion issues makes a lot of sense. Try to do it without > > making *any* mention of UTF-8 at all. > > > > > In summary, based on the discussions we have so far, I suspect that > > > there's not much to be discussed for the above cases. > > > > > > So, I'll post a v3 of this series, changing only: > > > > > > - U+00a0 (' '): NO-BREAK SPACE > > > - U+feff (''): ZERO WIDTH NO-BREAK SPACE (BOM) > > > > Ack, as long as those make *no* mention of UTF-8. Except perhaps to > > note that BOM is redundant because UTF-8 doesn't have a byteorder. > > I need to tell what UTF-8 codes are replaced, as otherwise the patch > wouldn't make much sense to reviewers, as both U+00a0 and whitespaces > are displayed the same way, and BOM is invisible. > No. Again, this is *nothing* to do with UTF-8. The encoding we choose to map between byte in the file and characters is *utterly* irrelevant here. If we were using UTF-7, UTF-16, or even (in the case of non- breaking space) one of the legacy 8-bit charsets that includes it like ISO8859-1, the issue would be precisely the same. It's about the *character* U+00A0 NO-BREAK SPACE; nothing to do with UTF-8 at all. Don't mention UTF-8. It's *irrelevant* and just shows that you can't actually bothered to stop and do any critical thinking about the matter at all. As I said, the only time that it makes sense to mention UTF-8 in this context is when talking about *why* the BOM is not needed. And even then, you could say "because we *aren't* using an encoding where endianness matters, such as UTF-16", instead of actually mentioning UTF-8. Try it ☺ > > > > > --- > > > > > > Now, this specific patch series address also this extra case: > > > > > > 5. curly commas: > > > > > > - U+2018 ('‘'): LEFT SINGLE QUOTATION MARK > > > - U+2019 ('’'): RIGHT SINGLE QUOTATION MARK > > > - U+201c ('“'): LEFT DOUBLE QUOTATION MARK > > > - U+201d ('”'): RIGHT DOUBLE QUOTATION MARK > > > > > > IMO, those should be replaced by ASCII commas: ' and ". > > > > > > The rationale is simple: > > > > > > - most were introduced during the conversion from Docbook, > > > markdown and LaTex; > > > - they don't add any extra value, as using "foo" of “foo” means > > > the same thing; > > > - Sphinx already use "fancy" commas at the output. > > > > > > I guess I will put this on a separate series, as this is not a bug > > > fix, but just a cleanup from the conversion work. > > > > > > I'll re-post those cleanups on a separate series, for patch per patch > > > review. > > > > Makes sense. > > > > The left/right quotation marks exists to make human-readable text much > > easier to read, but the key point here is that they are redundant > > because the tooling already emits them in the *output* so they don't > > need to be in the source, yes? > > Yes. > > > As long as the tooling gets it *right* and uses them where it should, > > that seems sane enough. > > > > However, it *does* break 'grep', because if I cut/paste a snippet from > > the documentation and try to grep for it, it'll no longer match. > > > > Consistency is good, but perhaps we should actually be consistent the > > other way round and always use the left/right versions in the source > > *instead* of relying on the tooling, to make searches work better? > > You claimed to care about that, right? > > That's indeed a good point. It would be interesting to have more > opinions with that matter. > > There are a couple of things to consider: > > 1. It is (usually) trivial to discover what document produced a > certain page at the documentation. > > For instance, if you want to know where the text under this > file came from, or to grep a text from it: > > https://www.kernel.org/doc/html/latest/admin-guide/cgroup-v2.html > > You can click at the "View page source" button at the first line. > It will show the .rst file used to produce it: > > https://www.kernel.org/doc/html/latest/_sources/admin-guide/cgroup-v2.rst.txt > > 2. If all you want is to search for a text inside the docs, > you can click at the "Search docs" box, which is part of the > Read the Docs theme. > > 3. Kernel has several extensions for Sphinx, in order to make life > easier for Kernel developers: > > Documentation/sphinx/automarkup.py > Documentation/sphinx/cdomain.py > Documentation/sphinx/kernel_abi.py > Documentation/sphinx/kernel_feat.py > Documentation/sphinx/kernel_include.py > Documentation/sphinx/kerneldoc.py > Documentation/sphinx/kernellog.py > Documentation/sphinx/kfigure.py > Documentation/sphinx/load_config.py > Documentation/sphinx/maintainers_include.py > Documentation/sphinx/rstFlatTable.py > > Those (in particular automarkup and kerneldoc) will also dynamically > change things during ReST conversion, which may cause grep to not work. > > 5. some PDF tools like evince will match curly commas if you > type an ASCII comma on their search boxes. > > 6. Some developers prefer to only deal with the files inside the > Kernel tree. Those are very unlikely to do grep with curly aspas. > > My opinion on that matter is that we should make life easier for > developers to grep on text files, as the ones using the web interface > are already served by the search box in html format or by tools like > evince. > > So, my vote here is to keep aspas as plain ASCII. OK, but all your reasoning is about the *character* used, not the encoding. So try to do it without mentioning ASCII, and especially without mentioning UTF-8. Your point is that the *character* is the one easily reachable on standard keyboard layouts, and the one which people are most likely to enter manually. It has *nothing* to do with charset encodings, so don't conflate is with talking about charset encodings. > > > > > > The remaining cases are future work, outside the scope of this v2: > > > > > > 6. Hyphen/Dashes and ellipsis > > > > > > - U+2212 ('−'): MINUS SIGN > > > - U+00ad (''): SOFT HYPHEN > > > - U+2010 ('‐'): HYPHEN > > > > > > Those three are used on places where a normal ASCII hyphen/minus > > > should be used instead. There are even a couple of C files which > > > use them instead of '-' on comments. > > > > > > IMO are fixes/cleanups from conversions and bad cut-and-paste. > > > > That seems to make sense. > > > > > - U+2013 ('–'): EN DASH > > > - U+2014 ('—'): EM DASH > > > - U+2026 ('…'): HORIZONTAL ELLIPSIS > > > > > > Those are auto-replaced by Sphinx from "--", "---" and "...", > > > respectively. > > > > > > I guess those are a matter of personal preference about > > > weather using ASCII or UTF-8. > > > > > > My personal preference (and Ted seems to have a similar > > > opinion) is to let Sphinx do the conversion. > > > > > > For those, I intend to post a separate series, to be > > > reviewed patch per patch, as this is really a matter > > > of personal taste. Hardly we'll reach a consensus here. > > > > > > > Again using the trigraph-like '--' and '...' instead of just using the > > plain text '—' and '…' breaks searching, because what's in the output > > doesn't match the input. Again consistency is good, but perhaps we > > should standardise on just putting these in their plain text form > > instead of the trigraphs? > > Good point. > > While I don't have any strong preferences here, there's something that > annoys me with regards to EM/EN DASH: > > With the monospaced fonts I'm using here - both at my e-mailer and > on my terminals, both EM and EN DASH are displayed look *exactly* > the same. Interesting. They definitely show differently in my terminal, and in the monospaced font in email.