The devil said, “Take this glyph-laden grimoire and try to render it cross-platform.”

merari42@lemmy.world · 5 months ago

The devil said, “Take this glyph-laden grimoire and try to render it cross-platform.”

esa@discuss.tchncs.de · 5 months ago

With ASCII æs the åriginal sin. Can’t even spell my name with that joke of an encoding >:(

palordrolap@fedia.io · 5 months ago

It’s a “joke” because it comes from an era when memory was at a premium and, for better or worse, the English-speaking world was at the forefront of technology.

The fact that English has an alphabet of length just shy of a power of two probably helped spur on technological advancement that would have otherwise quickly been bogged down in trying to represent all the necessary glyphs and squeeze them into available RAM.

… Or ROM for that matter. In the ROM, you’d need bit patterns or vector lists that describe each and every character and that’s necessarily an order of magnitude bigger than what’s needed to store a value per glyph. ROM is an order of magnitude cheaper, but those two orders of magnitude basically cancel out and you have a ROM that costs as much to make as the RAM.

And when you look at ASCII’s contemporary EBCDIC, you’ll realise what a marvel ASCII is by comparison. Things could have been much, much worse.

esa@discuss.tchncs.de · 5 months ago

It’s a joke because it includes useless letters nobody needs, like that weird o with the leg, and a rich set of field and record separating characters that are almost completely forgotten, etc, but not normal letters used in everyday language >:(

CameronDev@programming.dev · 5 months ago

weird o with the leg

Can you elaborate? Do you mean Q or p?

esa@discuss.tchncs.de · 5 months ago

Q. P is a common character across languages. But Q is mostly unused, at least outside the romance languages who appear to spell K that way. But that can be solved by letting the characters have the same code point, and rendering it as K in most regions, and Q in France. I can’t imagine any problems arising from that. :)

spizzat2@lemm.ee · edit-2 5 months ago

While we’re at it, I have some other suggestions…

For example, in year 1 that useless letter “c” would be dropped to be replased either by “k” or “s,” and likewise “x” would no longer be part of the alphabet. The only kase in which “c” would be retained would be the “ch” formation, which will be dealt with later. year 2 might reform “w” spelling, so that “which” and “one” would take the same konsonant, wile year 3 might well abolish “y” replasing it with “i” and iear 4 might fiks the “g/j” anomali wonse and for all.
Jenerally, then, the improvement would kontinue iear bai iear with iear 5 doing awai with useless double konsonants, and iears 6-12 or so modifaiing vowlz and the rimeining voist and unvoist konsonants. Bai iear 15 or sou, it wud fainali bi posibl tu meik ius ov thi ridandant letez “c,” “y” and “x”–bai now jast a memori in the maindz ov ould doderez–tu riplais “ch,” “sh,” and “th” rispektivli.
Fainali, xen, aafte sam 20 iers ov orxogrefkl riform, wi wud hev a lojikl, kohirnt speling in ius xrewawt xe Ingliy-spiking werld.

setVeryLoud(true);@lemmy.ca · 5 months ago

Look into the Shavian alphabet

tetris11@lemmy.ml · 5 months ago

surprisingly beautiful

toastal@lemmy.ml · 5 months ago

I have been this last week. Very cool. I even built a keyboard for Sailfish OS.

esa@discuss.tchncs.de · 5 months ago

Jess. Ai’m still lukking får the ekvivalent åv /r/JuropijenSpelling her ån lemmi. Fæntæstikk søbreddit vitsj æbsolutli nids lemmi representeysjen.

Onomatopoeia@lemmy.cafe · 5 months ago

Haha, nicely done. I had to work harder and harder to read it.

lad@programming.dev · 5 months ago

If that’s a joke, it’s a good one. Otherwise, well, there are a lot of “this letter isn’t needed let’s throw it away,” in most cases it will not work as good as you think.

esa@discuss.tchncs.de · 5 months ago

Yes, I am joking. We probably could do something like the old iso-646 or whatever it was that swapped letters depending on locale (or equivalent), but it’s not something we want to return to.

It’s also not something we’re entirely free of: Even though it’s mostly gone, apparently Bulgarian locales do something interesting with Cyrillic characters. cf https://tonsky.me/blog/unicode/

CameronDev@programming.dev · edit-2 5 months ago

That is quite a unique quip. I love the idea of geo-based rendering, every application that renders text needs location access to be strictly correct :D.

I’d go further with the codepoint reduction, and delete w (can use uu) instead, and delete k (hard c can take its place)

esa@discuss.tchncs.de · edit-2 5 months ago

To unjerk, as it were, it was a thing. So on old systems they’d do stuff like represent æøå with the same code points as {|}. Curly brace languages must have looked pretty weird back then:)

CameronDev@programming.dev · 5 months ago

It still is a thing in some fonts: https://blog.miguelgrinberg.com/post/font-ligatures-for-your-code-editor-and-terminal

Took me a while to work out what they were called. Font rendering is hard :(

mogranja@lemmy.world · 5 months ago

palordrolap@fedia.io · 5 months ago

Those “almost completely forgotten” characters were important when ASCII was invented, and a lot of that data is still around in some form or another. There’s also that, since they’re there, they’re still available for the use for which they were designed. You can be sure that someone would want to re-invent them if they weren’t already there.

Some operating systems did assign symbols to those characters anyway. MS-DOS being notable for this. Other standards also had code pages where different languages had different meanings for the byte ranges beyond ASCII. One language might have “é” in one place and another language in another. This caused problems.

Unicode is an extension of ASCII that covers all bases and has all the necessary symbols in fixed places.

That languages X, Y and Z don’t happen to have their alphabets in contiguous runs because they’re extended Latin is a problem, but not something that much can be done about.

It’s understandable that anyone would want their alphabet to be the base language, but one has to be or you end up in code page hell again. English happened to get there first.

If you want a fun exercise (for various interpretations of “fun”), design your own standard. Do you put the digits 0-9 as code points 0-9 or do you start with your preferred alphabet there? What about upper and lower case? Which goes first? Where do you put Chinese?

esa@discuss.tchncs.de · 5 months ago

I’m not entirely sure here, but you are aware you’re in a humour community, yeah?

palordrolap@fedia.io · 5 months ago

I see I’ve forgotten to put on my head net today. You know the one. Looks like a volleyball net. C shape. Attaches at the back. Catches things that go woosh.

The Ramen Dutchman@ttrpg.network · 5 months ago

To be fair, American Standard Code for Information Interchange was only meant to display English, which doesn’t care about the language your name is from.

TomMasz@lemmy.world · 5 months ago

Hey, now. Seven bits per character were good enough for Granddad, they should be good enough for you.

stingpie@lemmy.world · 5 months ago

Are you being sarcastic? I can’t tell.

esa@discuss.tchncs.de · 5 months ago

Yes I’m being sarcastic, but I also think utf-8 is plaintext these days. I really can’t spell my name in US ASCII. Like the other commenter here went into more detail on, it has its history, but isn’t suited for today’s international computer users.

KernelTale@programming.dev · 5 months ago

It’s just UTF-8

darksiderbun@lemmy.ca · 5 months ago

It’s also UTF-8 with BOM. It’s also windows western 1252. Dont get me started on international date time formatting and time assumptions :(

I wish it was just UTF-8

esa@discuss.tchncs.de · 5 months ago

It’s also some surprise internal representation as utf-16; that’s at least still in the realm of Unicode. Would also expect there’s utf-32 still floating around somewhere, but I couldn’t tell you where.

And is mysql still doing that thing with utf8 as a noob trap and utf8_for_real_we_mean_it_this_time_honest or whatever they called it as normal utf8?

khapyman@sopuli.xyz · 5 months ago

Me too. To this Day our national Electric invoice standard uses ISO-8859-15. An that’s just fine until somebody feels the need to have a look with Notepad, add a random space and save the file.

Notepad then helpfully changes the encoding to UTF-16 and the whole patch errors out somewhere down the chain.

fibojoly@sh.itjust.works · 5 months ago

You’d think things would be simple, otherwise the existence of UTF-8.

And yet for the last 17 years, every company I’ve been in has had some sort of horrible mess involving unicode and non-unicode and nobody either recognising the problem, or knowing how to solve it when they did recognise it (“well, the £ turns into a ? so we just replace any ? in the filename by a £”).

qqq@lemmy.world · 5 months ago

Android defaults to UTF16

levzzz@lemmy.world · 5 months ago

All because of java

qqq@lemmy.world · 5 months ago

TIL I didn’t realize Java used UTF16 for its internal representation. Looks like it’s a bit more complicated than that after Java 9 too

jaybone@lemmy.zip · 5 months ago

On the second day, he gave them css.

merari42@lemmy.world · 5 months ago

Text encoding ‘standards’ were clearly the devil’s work, handed down to humanity to sow chaos and suffering.

fibojoly@sh.itjust.works · 5 months ago

In my experience things are fine while you work in a single environment, or you have control over the entire pipeline of data. Things quickly turn into a story from the Bible when different systems start trying to communicate.

darklamer@lemmy.dbzer0.com · 5 months ago

Already with a single standard in a single project things have a tendency to start breaking down as soon as there’s more than one developer and disagreement arises about what the text in the standard specification actually means.

fibojoly@sh.itjust.works · 5 months ago

That’s true yeah. The seed of all the problems is assuming.

My teammates assumed System.DefaultEncoding must be some default value (UTF-8, they assumed, again) that would carry across all servers so no worries. Except no, it’s “whatever encoding is configured on this machine as the default code page”.
Which was the same across our networks, lucky them.
But for this one machine setup by an external contractor who had UTF-8 as default.
That one took me a while to track down…

ArbitraryValue@sh.itjust.works · 5 months ago

Get thee behind me, anything beyond extended ASCII.