Wait hold on I just realized.
-
@jannem Mmm, not sure about that. In my experience, “text encoding” and “language” are 2 orthogonal axes, and proper text handling requires you to know both.
This is one of the minor annoyances of Mastodon — it doesn't seem to be possible to mark parts of a post as being in different languages.
I don't have a huge problem with Han unification. I think it's a valid technical decision.
@krans @mcc
The bigger problem is that on the web and in apps there's usually no information on what language something is written in. Which means a browser or an app they can only guess what font to render Unicode han characters in. And when a user has installed support for more than one it is certain to frequently go wrong.Edit: you don't need to know the language to always render "ä" correctly. You do need to know the language in order to render "骨".
-
@krans @mcc
The bigger problem is that on the web and in apps there's usually no information on what language something is written in. Which means a browser or an app they can only guess what font to render Unicode han characters in. And when a user has installed support for more than one it is certain to frequently go wrong.Edit: you don't need to know the language to always render "ä" correctly. You do need to know the language in order to render "骨".
@jannem I agree. The root cause is that file formats, protocols and most programs are written almost entirely by English-speakers, who assume that only English-speaking people use computers and that all content will be in English.
For my entire lifetime, support for multilingual text has always been an afterthought — and many development frameworks make it incredibly difficult.
-
@jannem I agree. The root cause is that file formats, protocols and most programs are written almost entirely by English-speakers, who assume that only English-speaking people use computers and that all content will be in English.
For my entire lifetime, support for multilingual text has always been an afterthought — and many development frameworks make it incredibly difficult.
-
Update: I solved the problem, not by adding Chinese as an alternate language for my Android, but by deleting Japanese as an alternate language. Not sure when I did this or what I was trying to accomplish but I question Google's decision that informing it I may look at text in Japanese makes it conclude I DEFINITELY won't be looking at Chinese!
@mcc unfortunately there’s not really a good solution to this problem and Android, like everyone else, just has to pick a resolution method and stick with it. If you’ve heard of “Han Unification,” well it sounds like something that happened violently in 2200 BC but actually it happened quite recently in a Unicode meeting room and it causes this exact specific intractable issue
-
@noone2333 thanks. Do I need a 个 or something on the 八?
@mcc You can just say 八人 .
个 isn't needed here. It’s cleaner and more natural without it, especially in short, poetic or title-like phrases. -
In Pleco they look like this. I don't know if this is a different but regular hanzi font or if the CJK unification is messing me up somehow
EDIT: I currently think Tusky is showing me Japanese character variants https://social.mildlyfunctional.gay/@artemist/116146010272716935
@mcc i once got homework graded as incorrect because the japanese dictionary website i used did not use "lang" html attributes and firefox ended up selecting a korean font
-
This is what Tusky looks like.
@mcc I wonder if android has an API to indicate what language a text field is in? Phanpy web (iOS) handles the character variation just fine and I wonder if it’s because browsers let you set languages for text + it’s using the annotated post language?
-
@mcc I wonder if android has an API to indicate what language a text field is in? Phanpy web (iOS) handles the character variation just fine and I wonder if it’s because browsers let you set languages for text + it’s using the annotated post language?
@mcc it seems like it is actually using the declared language on my end because if I switch the post language here to Japanese I see the Japanese variants of the characters, and if I switch it back to Chinese I see the Chinese variants of the characters.
Test post marked as Japanese: 八人入
-
@Heliograph @rk The 个 is a friend that you give to a number so that it does not get lonely
@mcc @Heliograph @rk my mind went immediately to Knuth up-arrow, which gives numbers lots of friends
-
WAIT WTF this is an actual Chinese IME and it seems to be showing me Japanese characters. Ok I think Lenovo is fucking with me, one minute
@mcc do you ever feel like the time between you making a lighthearted shitpost and then uncovering a pit of writhing software horrors gets shorter every year.
-
@mcc
This is the problem with han unification; we're partway back to code pages and picking the right font to render a particular language.Like telling Danes and Swedes that ä and æ is the same character and so we'll just make them the same in Unicode.
-
@mcc unfortunately there’s not really a good solution to this problem and Android, like everyone else, just has to pick a resolution method and stick with it. If you’ve heard of “Han Unification,” well it sounds like something that happened violently in 2200 BC but actually it happened quite recently in a Unicode meeting room and it causes this exact specific intractable issue
@0xabad1dea @mcc also an act of violence, I would argue
-
@mcc it seems like it is actually using the declared language on my end because if I switch the post language here to Japanese I see the Japanese variants of the characters, and if I switch it back to Chinese I see the Chinese variants of the characters.
Test post marked as Japanese: 八人入
@mcc I couldn’t find an Android API to do this right, and I found what seems like a reasonable iOS API but it doesn’t do what I was expecting so I’m not actually sure it’s possible to do this well except with web technologies
-
Wait hold on I just realized. Is
八人入
A reasonable Chinese sentence
@mcc you've been on Chinese fanfiction sites I see
-
@Heliograph @rk The 个 is a friend that you give to a number so that it does not get lonely
@mcc @Heliograph @rk
It's a Totoro umbrella. -
Okay I now believe the problem is neither Tusky nor Lenovo but rather that Android is not a serious product and never has been. It seems Android may outright refuse to show scripts unless you've whitelisted the language. Problem: I think this menu is asking me which version of Chinese I want but the menu is in Chinese. I want to look at Chinese text so I can learn Chinese. I don't know it yet. I feel like I'm playing an adventure game.
* I may explore a PR later anyway.
@mcc i think the first one is japanese, the second simplified chinese.
-
@mcc I couldn’t find an Android API to do this right, and I found what seems like a reasonable iOS API but it doesn’t do what I was expecting so I’m not actually sure it’s possible to do this well except with web technologies
@porglezomp I'm talking to someone and we think we found the android API
-
@mcc unfortunately there’s not really a good solution to this problem and Android, like everyone else, just has to pick a resolution method and stick with it. If you’ve heard of “Han Unification,” well it sounds like something that happened violently in 2200 BC but actually it happened quite recently in a Unicode meeting room and it causes this exact specific intractable issue
@0xabad1dea @mcc I suppose the only actually reliable approach would be to store the IME locale per character or something so that it can be accurately rendered as it was written... or are these truly identical graphemes, and there's no chance of confusion in context? Even when people use multiple languages simultaneously?
(late edit after reading a lot more: ah, I see they DID just add a variant-selector character to effectively specify the locale... that seems a bit unlikely to gain major use, but technically I like it I guess)
Maybe one day we'll have UTF-8-2 and it'll just be infinitely extendable, rather than using a limited length prefix.
-
@0xabad1dea @mcc I suppose the only actually reliable approach would be to store the IME locale per character or something so that it can be accurately rendered as it was written... or are these truly identical graphemes, and there's no chance of confusion in context? Even when people use multiple languages simultaneously?
(late edit after reading a lot more: ah, I see they DID just add a variant-selector character to effectively specify the locale... that seems a bit unlikely to gain major use, but technically I like it I guess)
Maybe one day we'll have UTF-8-2 and it'll just be infinitely extendable, rather than using a limited length prefix.
@groxx @0xabad1dea There are various existing solutions but just because the solutions exist does not mean people follow them corectly
-
@groxx @0xabad1dea There are various existing solutions but just because the solutions exist does not mean people follow them corectly
@mcc @0xabad1dea definitely agreed. even technically, it seems very unlikely to me that any IME is going to choose to, like, add variant selectors *to every single character* and confuse their users when it's blended with other text or in a size-limited scenario. those characters already take up a ton of space, making it worse won't go over well.