GSS Script Tables Derivation

Processing Steps

Step Inputs Outputs Operation
1 - in_supported-scripts.txt
- Scripts.txt
Set of characters for each script set For each script set in in_supported-scripts.txt, read the code point ranges.
2 Output from Step 1 - out_p1s2_script-set
- out_p1s2_script-set_rej
Filter out non letters and digits and non-NFKC output characters.
3 - idnchars.txt
- out_p1s2_script-set
- out_p1s3_script-set
- out_p1s3_script-set_rej
Filter out characters not supported in IDNChars
4 - out_p1s3_script-set
- in_p1s4_script-set
- out_p1s4_script-set
- out_p1s4_script-set_rej
Intersect out_p1s3 with in_p1s4 (IDN language table) for the given script set
5 - out_p1s4_script-set
- in_p1s5_except_script-set
- out_p1s5_script-set Add / remove characters as exceptions, e.g. add out-of-script code points to the given script set, including characters not supported in IDN language tables.
6 - out_p1s5_script-set
- in_p1s6_script-set
- out_p1s6_script-set Map canonicals - for every code point in out_p1s5, compute the canonical mapping from in_p1s6
7 - out_p1s6_script-set - out_p1s7 Combine the mappings from all script sets into a single canonical mapping table
8 - out_p1s7 various Auxiliary step - sanity checking:
- every code point is mapped to final target in a single step
- all source characters are NFKC
Advisory step - check against TR36 Confusables.txt


Inputs

in_supported-scripts.txt

Supported "script sets", each set contains 1 or more unicode script names separated by '+'.
Source: manual
File:
in_supported-scripts.txt


Scripts.txt

Unicode standard Scripts.txt that lists the scripts and their ranges.
Source:
www.unicode.org
File: UniData/Scripts.txt


idnchars.txt

Unicode TR36 restricted list of characters recommended for use in IDN
Source:
www.unicode.org
File: TR36/idnchars.txt


in_p1s4_script-set

IANA language table characters for the script set.
Source: manually edited language tables from IANA IDN Language Tables Registry and Centr
- http://www.iana.org/assignments/idn/registered.htm
- http://www.centr.org/docs/2003/11/centr-ga20-idncodepoints.pdf
Files:
-
in_p1s4_latin.txt
- in_p1s4_hangul.txt
- in_p1s4_hiragana+katakana+han.txt


in_p1s5_except_script-set

Exceptions - characters to be added/removed e.g. Non in-script characters to be added to the script set, e.g. [a-z0-9]
Source: manual
Files:
-
in_p1s5_except_latin.txt
- in_p1s5_except_hangul.txt
- in_p1s5_except_hiragana+katakana+han.txt


in_p1s6_canonscript-set

Canonical mappings of codepoints
Sources: Derived from
in_p1s4_script-set, then

Files:
- in_p1s6_latin.txt.html
- in_p1s6_hangul.txt.html
- in_p1s6_hiragana+katakana+han.txt.html



Outputs

out_p1s2_script-set

Characters in the script set with only "letters" and "digits" that are NFKC outputs
Files:
-
out_p1s2_latin.html
- out_p1s2_hangul.html
- out_p1s2_hiragana+katakana+han.html


out_p1s2_script-set_rej

Characters in the script set rejected because it is non-letter or non-digit or non-NFKC closed.
Files:
-
out_p1s2_latin_rej.html
- out_p1s2_hangul_rej.html
- out_p1s2_hiragana+katakana+han_rej.html


out_p1s3_script-set

Result of out_p1s2_script-set - idnchars
Files:
-
out_p1s3_latin.html
- out_p1s3_hangul.html
- out_p1s3_hiragana+katakana+han.html


out_p1s3_script-set_rej

Result of out_p1s2_script-set - (out_p1s2_script-set - idnchars)
Files:
-
out_p1s3_latin_rej.html
- out_p1s3_hangul_rej.html
- out_p1s3_hiragana+katakana+han_rej.html


out_p1s4_script-set

Result of Boolean AND(out_p1s3_script-set, in_p1s4_script-set)
Files:
-
out_p1s4_latin.html
- out_p1s4_hangul.html
- out_p1s4_hiragana+katakana+han.html


out_p1s4_script-set_rej

Result of Boolean XOR(out_p1s3_script-set, in_p1s4_script-set)
Files:
-
out_p1s4_latin_rej.html
- out_p1s4_hangul_rej.html
- out_p1s4_hiragana+katakana+han_rej.html


out_p1s5_script-set

Result of Boolean OR(out_p1s4_script-set, in_p1s5_except_script-set)
Files:
-
out_p1s4_latin.html
- out_p1s4_hangul.html
- out_p1s4_hiragana+katakana+han.html


out_p1s6_script-set

Canonical mappings of code points in out_p1s5_script-set computed from in_p1s6_script-set.
Files:
-
out_p1s6_latin.txt.html
- out_p1s6_hangul.txt.html
- out_p1s6_hiragana+katakana+han.txt.html


out_p1s7

Combined canonical mapping table containing all code points from all supported script sets.
Files:
out_p1s7.txt.html