Message ID | cover.1620641727.git.mchehab+huawei@kernel.org |
---|---|
Headers | show |
Series | Get rid of UTF-8 chars that can be mapped as ASCII | expand |
On 10.05.21 12:26, Mauro Carvalho Chehab wrote: > > As Linux developers are all around the globe, and not everybody has UTF-8 > as their default charset, better to use UTF-8 only on cases where it is really > needed. > […] > The remaining patches on series address such cases on *.rst files and > inside the Documentation/ABI, using this perl map table in order to do the > charset conversion: > > my %char_map = ( > […] > 0x2013 => '-', # EN DASH > 0x2014 => '-', # EM DASH I might be performing bike shedding here, but wouldn't it be better to replace those two with "--", as explained in https://en.wikipedia.org/wiki/Dash#Approximating_the_em_dash_with_two_or_three_hyphens For EM DASH there seems to be even "---", but I'd say that is a bit too much. Or do you fear the extra work as some lines then might break the 80-character limit then? Ciao, Thorsten
Em Mon, 10 May 2021 12:52:44 +0200 Thorsten Leemhuis <linux@leemhuis.info> escreveu: > On 10.05.21 12:26, Mauro Carvalho Chehab wrote: > > > > As Linux developers are all around the globe, and not everybody has UTF-8 > > as their default charset, better to use UTF-8 only on cases where it is really > > needed. > > […] > > The remaining patches on series address such cases on *.rst files and > > inside the Documentation/ABI, using this perl map table in order to do the > > charset conversion: > > > > my %char_map = ( > > […] > > 0x2013 => '-', # EN DASH > > 0x2014 => '-', # EM DASH > I might be performing bike shedding here, but wouldn't it be better to > replace those two with "--", as explained in > https://en.wikipedia.org/wiki/Dash#Approximating_the_em_dash_with_two_or_three_hyphens > > For EM DASH there seems to be even "---", but I'd say that is a bit too > much. Yeah, we can do, instead: 0x2013 => '--', # EN DASH 0x2014 => '---', # EM DASH I was actually in doubt about those ;-) Btw, when producing HTML documentation, Sphinx should convert: -- into EN DASH and: --- into EM DASH So, the resulting html will be identical. > Or do you fear the extra work as some lines then might break the > 80-character limit then? No, I suspect that the line size won't be an issue. Some care should taken when EN DASH and EM DASH are used inside tables. Thanks, Mauro
On 10/05/2021 12:55, Mauro Carvalho Chehab wrote: > The main point on this series is to replace just the occurrences > where ASCII represents the symbol equally well > - U+2014 ('—'): EM DASH Em dash is not the same thing as hyphen-minus, and the latter does not serve 'equally well'. People use em dashes because — even in monospace fonts — they make text easier to read and comprehend, when used correctly. I accept that some of the other distinctions — like en dashes — are needlessly pedantic (though I don't doubt there is someone out there who will gladly defend them with the same fervour with which I argue for the em dash) and I wouldn't take the trouble to use them myself; but I think there is a reasonable assumption that when someone goes to the effort of using a Unicode punctuation mark that is semantic (rather than merely typographical), they probably had a reason for doing so. > - U+2018 ('‘'): LEFT SINGLE QUOTATION MARK > - U+2019 ('’'): RIGHT SINGLE QUOTATION MARK > - U+201c ('“'): LEFT DOUBLE QUOTATION MARK > - U+201d ('”'): RIGHT DOUBLE QUOTATION MARK (These are purely typographic, I have no problem with dumping them.) > - U+00d7 ('×'): MULTIPLICATION SIGN Presumably this is appearing in mathematical formulae, in which case changing it to 'x' loses semantic information. > Using the above symbols will just trick tools like grep for no good > reason. NBSP, sure. That one's probably an artefact of some document format conversion somewhere along the line, anyway. But what kinds of things with × or — in are going to be grept for? If there are em dashes lying around that semantically _should_ be hyphen-minus (one of your patches I've seen, for instance, fixes an *en* dash moonlighting as the option character in an `ethtool` command line), then sure, convert them. But any time someone is using a Unicode character to *express semantics*, even if you happen to think the semantic distinction involved is a pedantic or unimportant one, I think you need an explicit grep case to justify ASCIIfying it. -ed
On Mon, 2021-05-10 at 13:55 +0200, Mauro Carvalho Chehab wrote: > This patch series is doing conversion only when using ASCII makes > more sense than using UTF-8. > > See, a number of converted documents ended with weird characters > like ZERO WIDTH NO-BREAK SPACE (U+FEFF) character. This specific > character doesn't do any good. > > Others use NO-BREAK SPACE (U+A0) instead of 0x20. Harmless, until > someone tries to use grep[1]. Replacing those makes sense. But replacing emdashes — which are a distinct character that has no direct replacement in ASCII and which people do *deliberately* use instead of hyphen-minus — does not. Perhaps stick to those two, and any cases where an emdash or endash has been used where U+002D HYPHEN-MINUS *should* have been used. And please fix your cover letter which made no reference to 'grep', and only presented a completely bogus argument for the change instead.
On Mon, May 10, 2021 at 02:16:16PM +0100, Edward Cree wrote: > On 10/05/2021 12:55, Mauro Carvalho Chehab wrote: > > The main point on this series is to replace just the occurrences > > where ASCII represents the symbol equally well > > > - U+2014 ('—'): EM DASH > Em dash is not the same thing as hyphen-minus, and the latter does not > serve 'equally well'. People use em dashes because — even in > monospace fonts — they make text easier to read and comprehend, when > used correctly. > I accept that some of the other distinctions — like en dashes — are > needlessly pedantic (though I don't doubt there is someone out there > who will gladly defend them with the same fervour with which I argue > for the em dash) and I wouldn't take the trouble to use them myself; > but I think there is a reasonable assumption that when someone goes > to the effort of using a Unicode punctuation mark that is semantic > (rather than merely typographical), they probably had a reason for > doing so. I think you're overestimating the amount of care and typographical knowledge that your average kernel developer has. Most of these UTF-8 characters come from latex conversions and really aren't necessary (and are being used incorrectly). You seem quite knowedgeable about the various differences. Perhaps you'd be willing to write a document for Documentation/doc-guide/ that provides guidance for when to use which kinds of horizontal line? https://www.punctuationmatters.com/hyphen-dash-n-dash-and-m-dash/ talks about it in the context of publications, but I think we need something more suited to our needs for kernel documentation.
On Mon, May 10, 2021 at 13:55:18 +0200, Mauro Carvalho Chehab wrote: > $ git grep "CPU 0 has been" Documentation/RCU/ > Documentation/RCU/Design/Data-Structures/Data-Structures.rst:| #. CPU 0 has been in dyntick-idle mode for quite some time. When it | > Documentation/RCU/Design/Data-Structures/Data-Structures.rst:| notices that CPU 0 has been in dyntick idle mode, which qualifies | The kernel documentation uses hard line wraps, so such a naive grep is going to always fail unless such line wraps are taken into account. Not saying this isn't an improvement in and of itself, but smarter searching strategies are likely needed anyways. --Ben
On 10/05/2021 14:59, Matthew Wilcox wrote: > Most of these > UTF-8 characters come from latex conversions and really aren't > necessary (and are being used incorrectly). I fully agree with fixing those. The cover-letter, however, gave the impression that that was not the main purpose of this series; just, perhaps, a happy side-effect. > You seem quite knowedgeable about the various differences. Perhaps > you'd be willing to write a document for Documentation/doc-guide/ > that provides guidance for when to use which kinds of horizontal > line?I have Opinions about the proper usage of punctuation, but I also know that other people have differing opinions. For instance, I place spaces around an em dash, which is nonstandard according to most style guides. Really this is an individual enough thing that I'm not sure we could have a "kernel style guide" that would be more useful than general-purpose guidance like the page you linked. Moreover, such a guide could make non-native speakers needlessly self- conscious about their writing and discourage them from contributing documentation at all. I'm not advocating here for trying to push kernel developers towards an eats-shoots-and-leaves level of linguistic pedantry; rather, I merely think that existing correct usages should be left intact (and therefore, excising incorrect usage should only be attempted by someone with both the expertise and time to check each case). But if you really want such a doc I wouldn't mind contributing to it. -ed
On Mon, May 10, 2021 at 02:49:44PM +0100, David Woodhouse wrote: > On Mon, 2021-05-10 at 13:55 +0200, Mauro Carvalho Chehab wrote: > > This patch series is doing conversion only when using ASCII makes > > more sense than using UTF-8. > > > > See, a number of converted documents ended with weird characters > > like ZERO WIDTH NO-BREAK SPACE (U+FEFF) character. This specific > > character doesn't do any good. > > > > Others use NO-BREAK SPACE (U+A0) instead of 0x20. Harmless, until > > someone tries to use grep[1]. > > Replacing those makes sense. But replacing emdashes — which are a > distinct character that has no direct replacement in ASCII and which > people do *deliberately* use instead of hyphen-minus — does not. I regularly use --- for em-dashes and -- for en-dashes. Markdown will automatically translate 3 ASCII hypens to em-dashes, and 2 ASCII hyphens to en-dashes. It's much, much easier for me to type 2 or 3 hypens into my text editor of choice than trying to enter the UTF-8 characters. If we can make sphinx do this translation, maybe that's the best way of dealing with these two characters? Cheers, - Ted
On Mon, May 10, 2021 at 12:26:12PM +0200, Mauro Carvalho Chehab wrote: > There are several UTF-8 characters at the Kernel's documentation. [...] > Other UTF-8 characters were added along the time, but they're easily > replaceable by ASCII chars. > > As Linux developers are all around the globe, and not everybody has UTF-8 > as their default charset I'm not aware of a distribution that still allows selecting a non-UTF-8 charset in a normal flow in their installer. And if they haven't purged support for ancient encodings, that support is thoroughly bitrotten. Thus, I disagree that this is a legitimate concern. What _could_ be a legitimate reason is that someone is on a _terminal_ that can't display a wide enough set of glyphs. Such terminals are: • Linux console (because of vgacon limitations; patchsets to improve other cons haven't been mainlined) • some Windows terminals (putty, old Windows console) that can't borrow glyphs from other fonts like fontconfig can For the former, it's whatever your distribution ships in /usr/share/consolefonts/ or an equivalent, which is based on historic ISO-8859 and VT100 traditions. For the latter, the near-guaranteed character set is WGL4. Thus, at least two of your choices seem to disagree with the above: [dropped] > 0xd7 => 'x', # MULTIPLICATION SIGN [retained] > - U+2b0d ('⬍'): UP DOWN BLACK ARROW × is present in ISO-8859, V100, WGL4; I've found no font in /usr/share/consolefonts/ on my Debian unstable box that lacks this character. ⬍ is not found in any of the above. You might want to at least convert it to ↕ which is at least present in WGL4, and thus likely to be supported in fonts heeding Windows/Mac/OpenType recommendations. That still won't make it work on VT. Meow!
Em Mon, 10 May 2021 15:33:47 +0100 Edward Cree <ecree.xilinx@gmail.com> escreveu: > On 10/05/2021 14:59, Matthew Wilcox wrote: > > Most of these > > UTF-8 characters come from latex conversions and really aren't > > necessary (and are being used incorrectly). > I fully agree with fixing those. > The cover-letter, however, gave the impression that that was not the > main purpose of this series; just, perhaps, a happy side-effect. Sorry for the mess. The main reason why I wrote this series is because there are lots of UTF-8 left-over chars from the ReST conversion. See: - https://lore.kernel.org/linux-doc/20210507100435.3095f924@coco.lan/ A large set of the UTF-8 letf-over chars were due to my conversion work, so I feel personally responsible to fix those ;-) Yet, this series has two positive side effects: - it helps people needing to touch the documents using non-utf8 locales[1]; - it makes easier to grep for a text; [1] There are still some widely used distros nowadays (LTS ones?) that don't set UTF-8 as default. Last time I installed a Debian machine I had to explicitly set UTF-8 charset after install as the default were using ASCII encoding (can't remember if it was Debian 10 or an older version). Unintentionally, I ended by giving emphasis to the non-utf8 instead of giving emphasis to the conversion left-overs. FYI, this patch series originated from a discussion at linux-doc, reporting that Sphinx breaks when LANG is not set to utf-8[2]. That's why I probably ended giving the wrong emphasis at the cover letter. [2] See https://lore.kernel.org/linux-doc/20210506103913.GE6564@kitsune.suse.cz/ for the original report. I strongly suspect that the VM set by Michal to build the docs was using a distro that doesn't set UTF-8 as default. PS.: I intend to prepare afterwards a separate fix to avoid Sphinx logger to crash during Kernel doc builds when the locale charset is not UTF-8, but I'm not too fluent in python. So, I need some time to check if are there a way to just avoid python log crashes without touching Sphinx code and without needing to trick it to think that the machine's locale is UTF-8. See: while there was just a single document originally stored at the Kernel tree as a LaTeX document during the time we did the conversion (cdrom-standard.tex), there are several other documents stored as text that seemed to be generated by some tool like LaTeX, whose the original version were not preserved. Also, there were other documents using different markdown dialects that were converted via pandoc (and/or other similar tools). That's not to mention the ones that were converted from DocBook. Such tools tend to use some logic to use "neat" versions of some ASCII characters, like what this tool does: https://daringfireball.net/projects/smartypants/ (Sphinx itself seemed to use this tool on its early versions) All tool-converted documents can carry UTF-8 on unexpected places. See, on this series, a large amount of patches deal with U+A0 (NO-BREAK SPACE) chars. I can't see why someone writing a plain text document (or a ReST one) would type a NO-BREAK SPACE instead of a normal white space. The same applies, up to some sort, to curly commas: usually people just write ASCII "commas" on their documents, and use some tool like LaTeX or a text editor like libreoffice in order to convert them into “utf-8 curly commas”[3]. [3] Sphinx will do such things at the produced output, doing something similar to what smartypants does, nowadays using this: https://docutils.sourceforge.io/docs/user/smartquotes.html E. g.: - Straight quotes (" and ') turned into "curly" quote characters; - dashes (-- and ---) turned into en- and em-dash entities; - three consecutive dots (... or . . .) turned into an ellipsis char. > > You seem quite knowedgeable about the various differences. Perhaps > > you'd be willing to write a document for Documentation/doc-guide/ > > that provides guidance for when to use which kinds of horizontal > > line? > I have Opinions about the proper usage of punctuation, but I also know > that other people have differing opinions. For instance, I place > spaces around an em dash, which is nonstandard according to most > style guides. Really this is an individual enough thing that I'm not > sure we could have a "kernel style guide" that would be more useful > than general-purpose guidance like the page you linked. > Moreover, such a guide could make non-native speakers needlessly self- > conscious about their writing and discourage them from contributing > documentation at all. I don't think so. In a matter of fact, as a non-native speaker, I guess this can actually help people willing to write documents. > I'm not advocating here for trying to push > kernel developers towards an eats-shoots-and-leaves level of > linguistic pedantry; rather, I merely think that existing correct > usages should be left intact (and therefore, excising incorrect usage > should only be attempted by someone with both the expertise and time > to check each case). > > But if you really want such a doc I wouldn't mind contributing to it. IMO, a document like that can be helpful. I can help reviewing it. Thanks, Mauro
On Tue, 2021-05-11 at 11:00 +0200, Mauro Carvalho Chehab wrote: > Yet, this series has two positive side effects: > > - it helps people needing to touch the documents using non-utf8 locales[1]; > - it makes easier to grep for a text; > > [1] There are still some widely used distros nowadays (LTS ones?) that > don't set UTF-8 as default. Last time I installed a Debian machine > I had to explicitly set UTF-8 charset after install as the default > were using ASCII encoding (can't remember if it was Debian 10 or an > older version). This whole line of thinking is fundamentally wrong. A given set of characters in a "text file" are encoded with a specific character set / encoding. To interpret that file and convert the bytes back to characters, we need to use the *same* charset. That charset is a property of the text file, and each text file or piece of text in a system (like this email, which will contain a Content-Type: header indicating the charset) might be encoded with a *different* character set. In the days before you could connect computers together — or before you could exchange data between computers in different countries, at least — perhaps it made sense to store 'text' files without explicitly noting their encoding. And to interpret them using some kind of "default" character set. Those days are long gone. You're trying to work around an egregiously stupid bug, if you're trying to pander to "default" encodings. There *is* no default encoding that even makes sense, except perhaps UTF-8. To *speak* of them as you did shows a misunderstanding of how broken they are. It's *precisely* that kind of half-baked thinking which always used to lead to stupid assumptions and double conversions and Mojibake. Before we just standardised on UTF-8 everywhere and it stopped mattering so much. Just don't. Now, you *can* make this work if you really insist on it, even for systems with EBCDIC as their default encoding. Just make git do the "convert to local charset" on checkout, precisely the same way as it does CRLF for Windows systems. But it's stupid and anachronistic, so I don't really see the point.
Em Mon, 10 May 2021 14:49:44 +0100 David Woodhouse <dwmw2@infradead.org> escreveu: > On Mon, 2021-05-10 at 13:55 +0200, Mauro Carvalho Chehab wrote: > > This patch series is doing conversion only when using ASCII makes > > more sense than using UTF-8. > > > > See, a number of converted documents ended with weird characters > > like ZERO WIDTH NO-BREAK SPACE (U+FEFF) character. This specific > > character doesn't do any good. > > > > Others use NO-BREAK SPACE (U+A0) instead of 0x20. Harmless, until > > someone tries to use grep[1]. > > Replacing those makes sense. But replacing emdashes — which are a > distinct character that has no direct replacement in ASCII and which > people do *deliberately* use instead of hyphen-minus — does not. > > Perhaps stick to those two, and any cases where an emdash or endash has > been used where U+002D HYPHEN-MINUS *should* have been used. Ok. I'll rework the series excluding EM/EN DASH chars from it. I'll then apply manually the changes for EM/EN DASH chars (probably on a separate series) where it seems to fit. That should make easier to discuss such replacements. > And please fix your cover letter which made no reference to 'grep', and > only presented a completely bogus argument for the change instead. OK! Regards, Mauro
Em Mon, 10 May 2021 15:22:02 -0400 "Theodore Ts'o" <tytso@mit.edu> escreveu: > On Mon, May 10, 2021 at 02:49:44PM +0100, David Woodhouse wrote: > > On Mon, 2021-05-10 at 13:55 +0200, Mauro Carvalho Chehab wrote: > > > This patch series is doing conversion only when using ASCII makes > > > more sense than using UTF-8. > > > > > > See, a number of converted documents ended with weird characters > > > like ZERO WIDTH NO-BREAK SPACE (U+FEFF) character. This specific > > > character doesn't do any good. > > > > > > Others use NO-BREAK SPACE (U+A0) instead of 0x20. Harmless, until > > > someone tries to use grep[1]. > > > > Replacing those makes sense. But replacing emdashes — which are a > > distinct character that has no direct replacement in ASCII and which > > people do *deliberately* use instead of hyphen-minus — does not. > > I regularly use --- for em-dashes and -- for en-dashes. Markdown will > automatically translate 3 ASCII hypens to em-dashes, and 2 ASCII > hyphens to en-dashes. It's much, much easier for me to type 2 or 3 > hypens into my text editor of choice than trying to enter the UTF-8 > characters. Yeah, typing those UTF-8 chars are a lot harder than typing -- and --- on several text editors ;-) Here, I only type UTF-8 chars for accents (my US-layout keyboards are all set to US international, so typing those are easy). > If we can make sphinx do this translation, maybe that's > the best way of dealing with these two characters? Sphinx already does that by default[1], using smartquotes: https://docutils.sourceforge.io/docs/user/smartquotes.html Those are the conversions that are done there: - Straight quotes (" and ') turned into "curly" quote characters; - dashes (-- and ---) turned into en- and em-dash entities; - three consecutive dots (... or . . .) turned into an ellipsis char. So, we can simply use single/double commas, hyphens and dots for curly commas and ellipses. [1] There's a way to disable it at conf.py, but at the Kernel this is kept on its default: to automatically do such conversions. Thanks, Mauro