Re: [math-fun] Non-ASCII, Non-Baudot
Dan Asimov <dasimov@earthlink.net> wrote:
I'm not sure which characters on my Mac count as ASCII. What about accented Latin, like ?, ?, ?, ?, etc. ???
ASCII contains the 26 letters of the Latin alphabet, both upper and lowercase, the ten decimal digits, the space bar, and the following other glyphs: !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~ It also contains non-printing control codes, such as return (enter), linefeed (newline, line break), and tab. Unfortunately it does not contain accented letters, Greek letters, Hebrew letters, "smart" quotes (i.e. distinct open and closed quotes), not-equal, less-than-or-equal, greater-than-or-equal, not-less-than, not-greater-than, integral sign, intersection, union, square-root, superscript, subscript, em-dash, null, cents-sign, euro-sign, pounds-sterling-sign, modulo, times-sign, or division-sign. That's probably why math papers are typically written with something like TeX rather than in plain ASCII. Eugene Salamin <gene_salamin@yahoo.com> wrote:
ASCII are the 7-bit codes, 0-127. In the days of the teletype, the 8th bit was used for parity.
Right. But not helpful to people who haven't memorized the codes, hence have no idea whether the character they want to use has the eighth bit turned on or not. Indeed, if that character isn't in ASCII, which bits that character has depends on which code they are using instead of ASCII. rwg <rwg@sdf.org> wrote:
Before that, wasn't there a 5-bit Baudot code?
The real Baudot code hasn't been used in more than a century. There's another five-bit code commonly incorrectly called Baudot, which is still in use by radio amateurs, deaf people, and possibly by newsrooms. It was seldom if ever used by computers. It has even fewer characters. It doesn't contain lowercase letters. It only has room for the decimal digits because there's are shift characters, "FIGS" and "LTRS." As it was intended for mechanical teletypewriters, the commonest letters were given the fewest 1-bits, to minimize wear. This makes bitwise palindromes interesting, unlike in ASCII. (They're also interesting in Morse code. "Footstool" is the longest single-word Morse palindrome.) But it means the letters aren't in any sort of binary order: Adding 1 to the code for "A" won't get you "B." An eight-bit character code used on IBM mainframes is EBCDIC. It's older than ASCII and not related to it. I don't know if it's still in use anywhere. Fortunately, it almost never appears on the net. Nor do various other pre-ASCII codes. The oldest known binary character encoding -- another five-bit code -- was invented by Francis Bacon more than four centuries ago for cryptographic and steganographic purposes.
It is ridiculous and inexcusable for the digest to destroy, rather than transmit, characters that it doesn't recognize. Fix the digest.
Are you volunteering? It's not obvious what the right thing to do is. Simply transmitting the raw bit patterns is no better than replacing them with question marks, since without the headers saying which encoding is in use, gibberish will probably result. One approach would be to allow only one non-ASCII encoding per digest. If, say, a message in ISO-8859-1 comes along, other messages with that encoding will be allowed in the same digest, but any UTF-8 message will be held until the next digest, which will be all UTF-8. (Since these are all supersets of ASCII, plain ASCII messages will be allowed in all digests.) The header of each digest will say which, if any, non-ASCII encoding is in use. Another approach would be to translate the encodings. But what if the source encoding contains characters that can't be represented in the destination encoding? Also, any errors are likely to not only propagate, but to expand with each re-quoting. Here's a real example of that from the S.M. Stirling email list: Michael we couldn=C3=83=C2=83=C3=82=C2=83=C3=83=C2=82=C3=82= C2=A2=C3=83=C2=83=C3=82=C2=A2=C3=83=C2=82=C3=82=C2=82=C3=83= C2=82=C3=82=C2=AC=C3=83=C2=83=C3=82=C2=A2=C3=83=C2=82=C3=82= C2=84=C3=83=C2=82=C3=82=C2=A2t maintain the 95 or so divisions All that gibberish was intended to be a single "smart" apostrophe. I prefer "dumb" ASCII apostrophes, since they're less likely to go bad. Dave Dyer <ddyer@real-me.net> wrote:
I don't think it's fair to blame the digesting process. GIGO applies. There are (mostly ignored) standards for encoding non-ascii content such that at least in principle it can be correctly interpreted.
I agree. The problem is when multiple character codes appear in a single digest.
Even if the unusual character codes made it through the program chain intact, the endpoint that displays them would have no clue what to do with them. In fact, I'll give even odds that that is exactly what's happening - the digest contains an unusual ascii code, and the program being used to display the message shows you a ?
The digest, as sent, contains the literal question marks. My display program shows me exactly what's there. It doesn't "helpfully" interpret anything for me. So it's the digestifying program which is replacing non-ASCII content (not "unusual ASCII codes," whatever that's supposed to mean) with question marks. This isn't perfect, of course, but it's not an unreasonable compromise. The simplest approach, assuming senders are unwilling or unable to stop sending non-ASCII, would be to shut down the digest and force everyone to subscribe to one message at a time. I hope that doesn't happen. I prefer the digest, so as to mitigate header bloat. I like being able to save all messages on all lists forever onto a 4-gig thumb drive in my pocket. (I've been using email for a quarter of the time since the Civil War, and I save everything.) It's bad enough that the message *bodies* are bloated, with most people inexplicably quoting *every* *single* *word* of whatever they're replying to, even if it had been sent just minutes earlier, and even if it contains every single word of whatever it's in reply to. I won't mention the fact that this quoting is invariably in reverse order, since that doesn't affect the message size, merely the readability. (The purpose of quoting is to establish context for your reply. Quoting *everything* is like soaking a textbook in highlighter fluid.) Also, lots of people send the body of their message -- including quoted text -- twice in each message, the second time in bloated HTML encoding, as if email were a web page. They generally don't know that they're doing it. They're using mail software that should be taken out back and shot. For one thing, such messages, unlike plain text, can convey malware. The digestifying process is presumably either rejecting all such messages or stripping the HTML off -- and that's a good thing.
UTF-8 has become the de facto standard for the web, like it or not.
From the wikipedia page:
"UTF-8 has become the dominant character encoding for the World Wide Web, accounting for 83.3% of all Web pages in March 2015 (with most popular East Asian encoding, GB 2312, at 1.3%).[2][3][4] The Internet Mail Consortium (IMC) recommends that all e-mail programs be able to display and create mail using UTF-8.[5] The W3C recommends UTF-8 as default encoding in their main standards (XML and HTML)." On Tue, May 5, 2015 at 4:32 AM, Keith F. Lynch <kfl@keithlynch.net> wrote:
The simplest approach, assuming senders are unwilling or unable to stop sending non-ASCII, would be to shut down the digest and force everyone to subscribe to one message at a time.
Well, the simplest thing is to do nothing and those who care more about getting the original characters than about receiving a digest will switch.
I hope that doesn't happen. I prefer the digest, so as to mitigate header bloat. I like being able to save all messages on all lists forever onto a 4-gig thumb drive in my pocket.
Wouldn't gzip get rid of most of the overhead? -- Mike Stay - metaweta@gmail.com http://www.cs.auckland.ac.nz/~mike http://reperiendi.wordpress.com
On 05/05/2015 04:32, Keith F. Lynch wrote: [someone else -- Bill Gosper, I think:]
Fix the digest.
[Keith:]
Are you volunteering? It's not obvious what the right thing to do is. Simply transmitting the raw bit patterns is no better than replacing them with question marks, since without the headers saying which encoding is in use, gibberish will probably result. ... Another approach would be to translate the encodings. But what if the source encoding contains characters that can't be represented in the destination encoding?
If the destination encoding is something that represents the full Unicode repertoire (UTF-8 would be the obvious candidate) then that is not likely to happen. The actual difficulty here is that in some cases it will be hard to tell what the source encoding actually was.
The simplest approach, assuming senders are unwilling or unable to stop sending non-ASCII, would be to shut down the digest and force everyone to subscribe to one message at a time.
That seems less simple and obviously worse than just keeping things as they are. (I do not claim that keeping things as they are is the best option.)
I won't mention the fact that this quoting is invariably in reverse order, since that doesn't affect the message size, merely the readability.
The quoting isn't inevitably in reverse order; for instance, in this very discussion both you and I have quoted other people without reversing the order.
Also, lots of people send the body of their message -- including quoted text -- twice in each message, the second time in bloated HTML encoding, as if email were a web page.
No: as if HTML is able to represent simple typographical markup like italics and boldface and colour, and as if some people want those things in their email messages as they do in their web pages, printed documents, handwritten letters, etc. If you receive 100 email messages a day with 10kB of bloat each then every year the bloat costs you about 1MB of disc space (ignoring the possibility of compression -- redundant bloaty stuff compresses well). If you store them on a spinning-rust hard drive, that will cost you about 0.004 cents per year. If you store them on a fancy high-end SSD, it might be more like 0.1 cents per year. Maybe 0.2 cents per year, if you want the absolute fastest gold-plated random access to your email archive. You may well have a bigger volume of email than that, but if it's *much* bigger then I betcha the biggest contributors to its volume are not messages that could have been plain text but have been sent as HTML, but (1) messages that could have been plain text but have been sent as (say) Microsoft Word documents, and (2) messages with large attachments -- e.g., photographs or (hopefully non-malicious) computer software. These may be a blessing or a curse or both, but they have nothing to do with the merits of the math-fun digest. -- g
participants (3)
-
Gareth McCaughan -
Keith F. Lynch -
Mike Stay