| Step | Inputs | Outputs | Operation |
|---|---|---|---|
| 1 |
- in_supported-scripts.txt
- Scripts.txt |
Set of characters for each script set |
For each script set in in_supported-scripts.txt, read the code point ranges.
|
| 2 | Output from Step 1 |
- out_p1s2_script-set
- out_p1s2_script-set_rej |
Filter out non letters and digits and non-NFKC output characters. |
| 3 |
- idnchars.txt
- out_p1s2_script-set |
- out_p1s3_script-set
- out_p1s3_script-set_rej |
Filter out characters not supported in IDNChars |
| 4 |
- out_p1s3_script-set
- in_p1s4_script-set |
- out_p1s4_script-set
- out_p1s4_script-set_rej |
Intersect out_p1s3 with in_p1s4 (IDN language table) for the given script set |
| 5 |
- out_p1s4_script-set
- in_p1s5_except_script-set |
- out_p1s5_script-set | Add / remove characters as exceptions, e.g. add out-of-script code points to the given script set, including characters not supported in IDN language tables. |
| 6 |
- out_p1s5_script-set
- in_p1s6_script-set |
- out_p1s6_script-set | Map canonicals - for every code point in out_p1s5, compute the canonical mapping from in_p1s6 |
| 7 | - out_p1s6_script-set | - out_p1s7 | Combine the mappings from all script sets into a single canonical mapping table |
| 8 | - out_p1s7 | various |
Auxiliary step - sanity checking: - every code point is mapped to final target in a single step - all source characters are NFKC Advisory step - check against TR36 Confusables.txt
|
Supported "script sets", each set contains 1 or more unicode script names separated by '+'.
Source: manual
File: in_supported-scripts.txt
Unicode standard Scripts.txt that lists the scripts and their ranges.
Source: www.unicode.org
File: UniData/Scripts.txt
Unicode TR36 restricted list of characters recommended for use in IDN
Source: www.unicode.org
File: TR36/idnchars.txt
IANA language table characters for the script set.
Source: manually edited language tables from IANA IDN Language Tables Registry and Centr
- http://www.iana.org/assignments/idn/registered.htm
- http://www.centr.org/docs/2003/11/centr-ga20-idncodepoints.pdf
Files:
- in_p1s4_latin.txt
- in_p1s4_hangul.txt
- in_p1s4_hiragana+katakana+han.txt
Exceptions - characters to be added/removed e.g. Non in-script characters to be added to the script set, e.g. [a-z0-9]
Source: manual
Files:
- in_p1s5_except_latin.txt
- in_p1s5_except_hangul.txt
- in_p1s5_except_hiragana+katakana+han.txt
Canonical mappings of codepoints
Sources: Derived from in_p1s4_script-set, then
cn-chinese.html (IANA) + manual editing
Characters in the script set with only "letters" and "digits" that are NFKC outputs
Files:
- out_p1s2_latin.html
- out_p1s2_hangul.html
- out_p1s2_hiragana+katakana+han.html
Characters in the script set rejected because it is non-letter or non-digit or non-NFKC closed.
Files:
- out_p1s2_latin_rej.html
- out_p1s2_hangul_rej.html
- out_p1s2_hiragana+katakana+han_rej.html
Result of out_p1s2_script-set - idnchars
Files:
- out_p1s3_latin.html
- out_p1s3_hangul.html
- out_p1s3_hiragana+katakana+han.html
Result of out_p1s2_script-set - (out_p1s2_script-set - idnchars)
Files:
- out_p1s3_latin_rej.html
- out_p1s3_hangul_rej.html
- out_p1s3_hiragana+katakana+han_rej.html
Result of Boolean AND(out_p1s3_script-set, in_p1s4_script-set)
Files:
- out_p1s4_latin.html
- out_p1s4_hangul.html
- out_p1s4_hiragana+katakana+han.html
Result of Boolean XOR(out_p1s3_script-set, in_p1s4_script-set)
Files:
- out_p1s4_latin_rej.html
- out_p1s4_hangul_rej.html
- out_p1s4_hiragana+katakana+han_rej.html
Result of Boolean OR(out_p1s4_script-set, in_p1s5_except_script-set)
Files:
- out_p1s4_latin.html
- out_p1s4_hangul.html
- out_p1s4_hiragana+katakana+han.html
Canonical mappings of code points in out_p1s5_script-set computed from
in_p1s6_script-set.
Files:
- out_p1s6_latin.txt.html
- out_p1s6_hangul.txt.html
- out_p1s6_hiragana+katakana+han.txt.html
Combined canonical mapping table containing all code points from all supported script sets.
Files: out_p1s7.txt.html