hi, plex devs, thanks for great work and plz make it better with improved foreign language support.
there’s been some problems with library’s index flag of foreign characters, i ignored that issue for a while but a post from another user @Spelopp reminded me to report this.
i’m not sure if it’s only CJK related or every non-ascii characters are influenced, but i guess it could be the latter.
anyways, in case of CJK characters, there’s their own proper ways to handle index titles because there’s too many characters to deal with.
for example, in korean, chars from ‘가’ (0xAC00) to ‘깋’ (0xAE4B) should be indexed under ‘ㄱ’ (0x3131) or ‘가’ (0xAC00) flag, chars from ‘나’ (0xB098) ~ ‘닣’ (0xB2E3) should be indexed under ‘ㄴ’ (0x3134) or ‘나’ (0xB098) flag and so on. so we can have only 24 korean index flags than current random from 11172 flags.
*here’s how it works.
korean chars are displayed as a combination of ‘jamo’ letters, in order of ‘cho-seong’ (first sound, FS), ‘joong-seong’ (mid sound, MS), ‘jong-seong’ (end sound, ES).
unicode value of each chars are determined as follows.
*index of ‘cho-seong’ / FS
| ㄱ | ㄲ | ㄴ | ㄷ | ㄸ | ㄹ | ㅁ | ㅂ | ㅃ | ㅅ | ㅆ | ㅇ | ㅈ | ㅉ | ㅊ | ㅋ | ㅌ | ㅍ | ㅎ |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 |
*index of ‘joong-seong’ / MS
| ㅏ | ㅐ | ㅑ | ㅒ | ㅓ | ㅔ | ㅕ | ㅖ | ㅗ | ㅘ | ㅙ | ㅚ | ㅛ | ㅜ | ㅝ | ㅞ | ㅟ | ㅠ | ㅡ | ㅢ | ㅣ |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 |
*index of ‘jong-seong’ / ES
| null | ㄱ | ㄲ | ㄳ | ㄴ | ㄵ | ㄶ | ㄷ | ㄹ | ㄺ | ㄻ | ㄼ | ㄽ | ㄾ | ㄿ | ㅀ | ㅁ | ㅂ | ㅄ | ㅅ | ㅆ | ㅇ | ㅈ | ㅊ | ㅋ | ㅌ | ㅍ | ㅎ |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 |
unicode.value = 0xAC00 + [FS]*0x24C(588=21*28) + [MS]*0x1C(28) + [ES]
for example,
'글' = 'ㄱ' + 'ㅡ' + 'ㄹ' = 0xAC00 + [0]*0x24C + [18]*0x1C + [8] = 0xAE00
every chars start with FS ‘ㄱ’ are in range of ‘가’ (= ‘ㄱ’ + ‘ㅏ’ + ‘null’ = 0xAC00) to ‘깋’ (= ‘ㄱ’ + ‘ㅣ’ + ‘ㅎ’ = 0xAE4B)
and should be grouped and indexed under ‘ㄱ’ (0x3131) or ‘가’ (0xAC00)
here’s why this has to be fixed.
in case of this small library, there’s some korean index flags.
however, if it was properly flagged, items indexed under ‘박’ and ‘버’ in above picture would be indexed under ‘ㅂ’ or ‘바’
and items under ‘아’, ‘악’, ‘에’, ‘원’, ‘윤’ and ‘이’ would be indexed under ‘ㅇ’ or ‘아’.
8 index flags vs 2 in this case, 11172 vs 19 in theory.
according to dictionary index scheme instead of random chars.
isn’t it obvious that it has to be fixed to be useful?
I don’t know how other unicode characters are handled, but like in korean,
japanese titles index should be grouped as 10(or 11, maybe?) chars not 50, as follows.
| a | i | u | e | o | |
|---|---|---|---|---|---|
| 1: あ / ∅ | あ | い | う | え | お |
| 2: か / K | か | き | く | け | こ |
| 3: さ / S | さ | し | す | せ | そ |
| 4: た / T | た | ち | つ | て | と |
| 5: な / N | な | に | ぬ | ね | の |
| 6: は / H | は | ひ | ふ | へ | ほ |
| 7: ま / M | ま | み | む | め | も |
| 8: や / Y | や | ゆ | よ | ||
| 9: ら / R | ら | り | る | れ | ろ |
| 10: わ / W | わ | ゐ | ゑ | を | |
| 11: ん / ng | ん |
thanks.
