New subject: [math-fun] Non-ASCII, Non-Baudot

4 May 2015

      Dan Asimov <dasimov@earthlink.net> wrote:
...
I'm not sure which characters on my Mac count as ASCII.
What about accented Latin, like ?, ?, ?, ?, etc. ???
ASCII contains the 26 letters of the Latin alphabet, both upper and
lowercase, the ten decimal digits, the space bar, and the following
other glyphs:  !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~

It also contains non-printing control codes, such as return (enter),
linefeed (newline, line break), and tab.

Unfortunately it does not contain accented letters, Greek letters,
Hebrew letters, "smart" quotes (i.e. distinct open and closed quotes),
not-equal, less-than-or-equal, greater-than-or-equal, not-less-than,
not-greater-than, integral sign, intersection, union, square-root,
superscript, subscript, em-dash, null, cents-sign, euro-sign,
pounds-sterling-sign, modulo, times-sign, or division-sign.  That's
probably why math papers are typically written with something like TeX
rather than in plain ASCII.

Eugene Salamin <gene_salamin@yahoo.com> wrote:
...
ASCII are the 7-bit codes, 0-127.  In the days of the teletype, the
8th bit was used for parity.
Right.  But not helpful to people who haven't memorized the codes,
hence have no idea whether the character they want to use has the
eighth bit turned on or not.  Indeed, if that character isn't in
ASCII, which bits that character has depends on which code they are
using instead of ASCII.

rwg <rwg@sdf.org> wrote:
...
Before that, wasn't there a 5-bit Baudot code?
The real Baudot code hasn't been used in more than a century.  There's
another five-bit code commonly incorrectly called Baudot, which is
still in use by radio amateurs, deaf people, and possibly by newsrooms.
It was seldom if ever used by computers.  It has even fewer characters.
It doesn't contain lowercase letters.  It only has room for the decimal
digits because there's are shift characters, "FIGS" and "LTRS."

As it was intended for mechanical teletypewriters, the commonest
letters were given the fewest 1-bits, to minimize wear.  This makes
bitwise palindromes interesting, unlike in ASCII.  (They're also
interesting in Morse code.  "Footstool" is the longest single-word
Morse palindrome.)  But it means the letters aren't in any sort of
binary order:  Adding 1 to the code for "A" won't get you "B."

An eight-bit character code used on IBM mainframes is EBCDIC.  It's
older than ASCII and not related to it.  I don't know if it's still
in use anywhere.  Fortunately, it almost never appears on the net.
Nor do various other pre-ASCII codes.

The oldest known binary character encoding -- another five-bit code
-- was invented by Francis Bacon more than four centuries ago for
cryptographic and steganographic purposes.
...
It is ridiculous and inexcusable for the digest to destroy,
rather than transmit, characters that it doesn't recognize.
Fix the digest.
Are you volunteering?  It's not obvious what the right thing to do is.
Simply transmitting the raw bit patterns is no better than replacing
them with question marks, since without the headers saying which
encoding is in use, gibberish will probably result.

One approach would be to allow only one non-ASCII encoding per digest.
If, say, a message in ISO-8859-1 comes along, other messages with that
encoding will be allowed in the same digest, but any UTF-8 message
will be held until the next digest, which will be all UTF-8.  (Since
these are all supersets of ASCII, plain ASCII messages will be allowed
in all digests.)  The header of each digest will say which, if any,
non-ASCII encoding is in use.

Another approach would be to translate the encodings.  But what if
the source encoding contains characters that can't be represented in
the destination encoding?  Also, any errors are likely to not only
propagate, but to expand with each re-quoting.  Here's a real example
of that from the S.M. Stirling email list:

  Michael we couldn=C3=83=C2=83=C3=82=C2=83=C3=83=C2=82=C3=82=
  C2=A2=C3=83=C2=83=C3=82=C2=A2=C3=83=C2=82=C3=82=C2=82=C3=83=
  C2=82=C3=82=C2=AC=C3=83=C2=83=C3=82=C2=A2=C3=83=C2=82=C3=82=
  C2=84=C3=83=C2=82=C3=82=C2=A2t maintain the 95 or so divisions

All that gibberish was intended to be a single "smart" apostrophe.  I
prefer "dumb" ASCII apostrophes, since they're less likely to go bad.

Dave Dyer <ddyer@real-me.net> wrote:
...
I don't think it's fair to blame the digesting process.  GIGO
applies.  There are (mostly ignored) standards for encoding
non-ascii content such that at least in principle it can be
correctly interpreted.
I agree.  The problem is when multiple character codes appear in a
single digest.
...
Even if the unusual character codes made it through the program
chain intact, the endpoint that displays them would have no clue
what to do with them.  In fact, I'll give even odds that that is
exactly what's happening - the digest contains an unusual ascii
code, and the program being used to display the message shows
you a ?
The digest, as sent, contains the literal question marks.  My display
program shows me exactly what's there.  It doesn't "helpfully"
interpret anything for me.  So it's the digestifying program which
is replacing non-ASCII content (not "unusual ASCII codes," whatever
that's supposed to mean) with question marks.  This isn't perfect,
of course, but it's not an unreasonable compromise.

The simplest approach, assuming senders are unwilling or unable to
stop sending non-ASCII, would be to shut down the digest and force
everyone to subscribe to one message at a time.  I hope that doesn't
happen.  I prefer the digest, so as to mitigate header bloat.  I like
being able to save all messages on all lists forever onto a 4-gig
thumb drive in my pocket.  (I've been using email for a quarter of the
time since the Civil War, and I save everything.)  It's bad enough
that the message *bodies* are bloated, with most people inexplicably
quoting *every* *single* *word* of whatever they're replying to, even
if it had been sent just minutes earlier, and even if it contains
every single word of whatever it's in reply to.  I won't mention the
fact that this quoting is invariably in reverse order, since that
doesn't affect the message size, merely the readability.

(The purpose of quoting is to establish context for your reply.
Quoting *everything* is like soaking a textbook in highlighter fluid.)

Also, lots of people send the body of their message -- including
quoted text -- twice in each message, the second time in bloated HTML
encoding, as if email were a web page.  They generally don't know that
they're doing it.  They're using mail software that should be taken
out back and shot.  For one thing, such messages, unlike plain text,
can convey malware.  The digestifying process is presumably either
rejecting all such messages or stripping the HTML off -- and that's
a good thing.

Re: [math-fun] Non-ASCII, Non-Baudot

Keith F. Lynch

Mike Stay

Gareth McCaughan

tags

participants (3)