Some findings on studying surrogate pair

HKSCS

Unicode surrogate programming with the Java language
谈Unicode编码,简要解释UCS、UTF、BMP、BOM等名词
HKSCS-2004 Support for Windows Platform
「HKSCS compatibility」– 香港字集兼容方案
Hong Kong Supplementary Character Set - 2004
HKSCS
HKSCS 字碼表
Win7 睇香港字問題
關於Windows 7用code page 951既問題
XML Services
Universal Character Set characters - Surrogates
Block-U20000-CJK-Unified-Ideographs-Extension-B
Unicode5.2.0
Java中的字符集编码入门技巧
Chinese / HKSCS 2004 conversion script

Surrogate

Why UTF-32 instead of UTF-16 if we have surrogate pairs?
handling-unicode-surrogate-values-in-java-strings
what-is-a-surrogate-pair-in-java
Java中的字符集编码入门(六)Java中的增补字符

XMLBeans Related

Piccolo XML Parser for Java

Sample

1
2
00000010 00100100 01100101
10001011 10111001

28BB9

►Unicode Code Point Blocks - Code Charts

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
U0000: C0 Controls and Basic Latin
U0080: C1 Controls and Latin-1 Supplement
U0100: Latin Extended-A
U0180: Latin Extended-B
U0250: IPA Extensions
U02B0: Spacing Modifier Letters
U0300: Combining Diacritical Marks
U0370: Greek and Coptic
U0400: Cyrillic
U0500: Cyrillic Supplement
U0530: Armenian
U0590: Hebrew
U0600: Arabic
U0700: Syriac
U0750: Arabic Supplement
U0780: Thaana
U07C0: NKo
U0800: Samaritan
U0840: Mandaic
U08A0: Arabic Extended-A
U0900: Devanagari
U0980: Bengali
U0A00: Gurmukhi
U0A80: Gujarati
U0B00: Oriya
U0B80: Tamil
U0C00: Telugu
U0C80: Kannada
U0D00: Malayalam
U0D80: Sinhala
U0E00: Thai
U0E80: Lao
U0F00: Tibetan
U1000: Myanmar
U10A0: Georgian
U1100: Hangul Jamo
U1200: Ethiopic
U1380: Ethiopic Supplement
U13A0: Cherokee
U1400: Unified Canadian Aboriginal Syllabics
U1680: Ogham
U16A0: Runic
U1700: Tagalog
U1720: Hanunoo
U1740: Buhid
U1760: Tagbanwa
U1780: Khmer
U1800: Mongolian
U18B0: Unified Canadian Aboriginal Syllabics Extended
U1900: Limbu
U1950: Tai Le
U1980: New Tai Lue
U19E0: Khmer Symbols
U1A00: Buginese
U1A20: Tai Tham
U1B00: Balinese
U1B80: Sundanese
U1BC0: Batak
U1C00: Lepcha
U1C50: Ol Chiki
U1CC0: Sundanese Supplement
U1CD0: Vedic Extensions
U1D00: Phonetic Extensions
U1D80: Phonetic Extensions Supplement
U1DC0: Combining Diacritical Marks Supplement
U1E00: Latin Extended Additional
U1F00: Greek Extended
U2000: General Punct
UAtion
U2070: Superscripts and Subscripts
U20A0: Currency Symbols
U20D0: Combining Diacritical Marks for Symbols
U2100: Letterlike Symbols
U2150: Number Forms
U2190: Arrows
U2200: Mathematical Operators
U2300: Miscellaneous Technical
U2400: Control Pictures
U2440: Optical Character Recognition
U2460: Enclosed Alphanumerics
U2500: Box Drawing
U2580: Block Elements
U25A0: Geometric Shapes
U2600: Miscellaneous Symbols
U2700: Dingbats
U27C0: Miscellaneous Mathematical Symbols-A
U27F0: Supplemental Arrows-A
U2800: Braille Patterns
U2900: Supplemental Arrows-B
U2980: Miscellaneous Mathematical Symbols-B
U2A00: Supplemental Mathematical Operators
U2B00: Miscellaneous Symbols and Arrows
U2C00: Glagolitic
U2C60: Latin Extended-C
U2C80: Coptic
U2D00: Georgian Supplement
U2D30: Tifinagh
U2D80: Ethiopic Extended
U2DE0: Cyrillic Extended-A
U2E00: Supplemental Punct
UAtion
U2E80: CJK Radicals Supplement
U2F00: Kangxi Radicals
U2FF0: Ideographic Description Characters
U3000: CJK Symbols and Punct
UAtion
U3040: Hiragana
U30A0: Katakana
U3100: Bopomofo
U3130: Hangul Compatibility Jamo
U3190: Kanbun
U31A0: Bopomofo Extended
U31C0: CJK Strokes
U31F0: Katakana Phonetic Extensions
U3200: Enclosed CJK Letters and Months
U3300: CJK Compatibility
U3400: CJK Unified Ideographs Extension A
U4DC0: Yijing Hexagram Symbols
U4E00: CJK Unified Ideographs
UA000: Yi Syllables
UA490: Yi Radicals
UA4D0: Lisu
UA500: Vai
UA640: Cyrillic Extended-B
UA6A0: Bamum
UA700: Modifier Tone Letters
UA720: Latin Extended-D
UA800: Syloti Nagri
UA830: Common Indic Number Forms
UA840: Phags-pa
UA880: Saurashtra
UA8E0: Devanagari Extended
UA900: Kayah Li
UA930: Rejang
UA960: Hangul Jamo Extended-A
UA980: Javanese
UAA00: Cham
UAA60: Myanmar Extended-A
UAA80: Tai Viet
UAAE0: Meetei Mayek Extensions
UAB00: Ethiopic Extended-A
UABC0: Meetei Mayek
UAC00: Hangul Syllables
UD7B0: Hangul Jamo Extended-B
UD800: High Surrogates
UDB80: High Private Use Surrogates
UDC00: Low Surrogates UE000: Private Use Area
UF900: CJK Compatibility Ideographs
UFB00: Alphabetic Presentation Forms
UFB50: Arabic Presentation Forms-A
UFE00: Variation Selectors
UFE10: Vertical Forms
UFE20: Combining Half Marks
UFE30: CJK Compatibility Forms
UFE50: Small Form Variants
UFE70: Arabic Presentation Forms-B
UFF00: Halfwidth and Fullwidth Forms
UFFF0: Specials
U10000: Linear B Syllabary
U10080: Linear B Ideograms
U10100: Aegean Numbers
U10140: Ancient Greek Numbers
U10190: Ancient Symbols
U101D0: Phaistos Disc
U10280: Lycian
U102A0: Carian
U10300: Old Italic
U10330: Gothic
U10380: Ugaritic
U103A0: Old Persian
U10400: Deseret
U10450: Shavian
U10480: Osmanya
U10800: Cypriot Syllabary
U10840: Imperial Aramaic
U10900: Phoenician
U10920: Lydian
U10980: Meroitic Hieroglyphs
U109A0: Meroitic Cursive
U10A00: Kharoshthi
U10A60: Old South Arabian
U10B00: Avestan
U10B40: Inscriptional Parthian
U10B60: Inscriptional Pahlavi
U10C00: Old Turkic
U10E60: Rumi Numeral Symbols
U11000: Brahmi
U11080: Kaithi
U110D0: Sora Sompeng
U11100: Chakma
U11180: Sharada
U11680: Takri
U12000: Cuneiform
U12400: Cuneiform Numbers and Punctuation
U13000: Egyptian Hieroglyphs
U16800: Bamum Supplement
U16F00: Miao
U1B000: Kana Supplement
U1D000: Byzantine Musical Symbols
U1D100: Musical Symbols
U1D200: Ancient Greek Musical Notation
U1D300: Tai XUAn Jing Symbols
U1D360: Counting Rod Numerals
U1D400: Mathematical Alphanumeric Symbols
U1EE00: Arabic Mathematical Alphabetic Symbols
U1F000: Mahjong Tiles
U1F030: Domino Tiles
U1F0A0: Playing Cards
U1F100: Enclosed Alphanumeric Supplement
U1F200: Enclosed Ideographic Supplement
U1F300: Miscellaneous Symbols And Pictographs
U1F600: Emoticons
U1F680: Transport And Map Symbols
U1F700: Alchemical Symbols ►
U20000: CJK Unified Ideographs Extension B
U2A700: CJK Unified Ideographs Extension C
U2B740: CJK Unified Ideographs Extension D
U2F800: CJK Compatibility Ideographs Supplement UE0000: Tags UE0100: Variation Selectors Supplement
UF0000: Supplementary Private Use Area-A
U100000: Supplementary Private Use Area-B Outdated Tutorials
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
package xx;

import java.io.File;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStream;
import java.io.OutputStream;
import java.io.OutputStreamWriter;
import java.math.BigDecimal;
import java.math.BigInteger;

import org.apache.xmlbeans.XmlException;
import org.apache.xmlbeans.impl.piccolo.xml.XMLStreamReader;
import org.openuri.easypo.LineItem;
import org.openuri.easypo.PurchaseOrderDocument;

public class Test {
public static void main(String[] args) throws IOException, XmlException {

String u = "\uD840\uDC08asasa";

System.out.println(u + "+ " + u.length());
System.out.println(Character.isHighSurrogate(u.charAt(0)));
System.out.println("isSupplementaryCodePoint " + Character.isSupplementaryCodePoint(u.charAt(0)));
System.out.println((int) u.charAt(1));
System.out.println("codePoint = > " + (int) u.codePointAt(0) + ", 0x" + Integer.toHexString(u.codePointAt(0)));

System.out.println("@@@ " + u.codePointCount(0, u.length()));

String s = String.valueOf(Character.toChars(0x200D9));
System.out.println("<<< " + Character.isSupplementaryCodePoint(0x200D9));
char[] chars = s.toCharArray();
for (char c : chars) {
System.out.format("BB => %x\n", (short) c);
}

System.out.format("hex -> decimal %s\n", 0x2F81A);

InputStream in = new FileInputStream("C:/easypo.xml");
byte[] bufferd = new byte[4096];
int length = 0;
StringBuilder sb = new StringBuilder();
while ((length = in.read(bufferd)) != -1) {
sb.append(new String(bufferd, 0, length, "UTF-16BE"));
}

OutputStream os = new FileOutputStream("C:/out2.txt");
//OutputStreamWriter wr = new OutputStreamWriter(os);
os.write(0xfe);
os.write(0xff);

// UTF-8
//os.write(0xef);
//os.write(0xbb);
//os.write(0xbf);

StringBuilder sb2 = new StringBuilder();
for (int i = 1; i < sb.length() - 1; i++) {
//sb2.append(sb.charAt(i));

//System.out.println("code Point => " + sb.charAt(i + 1));
if (Character.isHighSurrogate(sb.charAt(i)) || Character.isLowSurrogate(sb.charAt(i))) {

String code = Integer.toHexString(sb.charAt(i));
os.write(Integer.decode("0x" + code.substring(0, 2)));
os.write(Integer.decode("0x" + code.substring(2, 4)));

code = Integer.toHexString(sb.charAt(i + 1));
os.write(Integer.decode("0x" + code.substring(0, 2)));
os.write(Integer.decode("0x" + code.substring(2, 4)));

i++;
} else {
if (Character.isDefined(sb.charAt(i)))
sb2.append(sb.charAt(i));
}
}

File poXmlFile = new File("C:/easypo.xml");
//System.out.println("AAAAAAAAA " + sb2.toString());

String updatedPoXml = addLineItem(sb2.toString(), s, "5", "20.00", "6");
//String updatedPoXml = addLineItem(poXmlFile, "a new Item", "5", "20.00", "6");
//System.out.println(new String(updatedPoXml.getBytes("UTF-8"), "UTF-8"));

//wr.close();
os.close();

}

private static String addLineItem(String purchaseOrder, String itemDescription, String perUnitOuncesString,
String itemPriceString, String itemQuantityString) throws XmlException, IOException {
PurchaseOrderDocument poDoc = null;

XMLStreamReader reader = new XMLStreamReader();
poDoc = PurchaseOrderDocument.Factory.parse(purchaseOrder);

// Convert incoming data to types that can be used in accessors.
BigDecimal perUnitOunces = new BigDecimal(perUnitOuncesString);
BigDecimal itemPrice = new BigDecimal(itemPriceString);
BigInteger itemQUAntity = new BigInteger(itemQuantityString);

// Add the new element.
LineItem newItem = poDoc.getPurchaseOrder().addNewLineItem();
newItem.setDescription(itemDescription);
newItem.setPerUnitOunces(perUnitOunces);
newItem.setPrice(itemPrice);
newItem.setQuantity(itemQuantity);

return poDoc.toString();
}
}

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
private String processInterfaceStr(String interfaceStr, HashMap extBCharMap) throws IOException {
StringBuilder output = new StringBuilder();

for (int i = 0; i < interfaceStr.length(); i++) {
if (Character.isHighSurrogate(interfaceStr.charAt(i))
|| Character.isLowSurrogate(interfaceStr.charAt(i))) {
ByteArrayOutputStream surrogatePair = new ByteArrayOutputStream();
String code = Integer.toHexString(interfaceStr.charAt(i));
int part1 = Integer.decode("0x" + code.substring(0, 2));
int part2 = Integer.decode("0x" + code.substring(2, 4));

surrogatePair.write(part1);
surrogatePair.write(part2);

code = Integer.toHexString(interfaceStr.charAt(i + 1));
part1 = Integer.decode("0x" + code.substring(0, 2));
part2 = Integer.decode("0x" + code.substring(2, 4));

surrogatePair.write(part1);
surrogatePair.write(part2);
surrogatePair.close();

i++;

output.append("${" + extBCharMap.size() + "}");
extBCharMap.put(String.valueOf(extBCharMap.size()),
surrogatePair.toString("UTF-16BE"));
} else {
if (Character.isDefined(interfaceStr.charAt(i)))
output.append(interfaceStr.charAt(i));
}
}

return output.toString();
}

private static void checkExtBCharMap(HashMap extBCharMap) throws Exception {
if (extBCharMap.size() > 0) {
StringBuilder sb = new StringBuilder();
for (String extBChar : extBCharMap.values()) {
sb.append(showHexCode(extBChar));
}
Exception e = new Exception( "Surrogate pair characters are found in unexpected fields. " + sb.toString());
throw e;
}
}

private static String showHexCode(String extBChar) {
StringBuilder sb = new StringBuilder();

String codepoint = "U+" + Long.toHexString(extBChar.codePointAt(0)).toUpperCase();

sb.append(codepoint);

sb.append(" ");

/*for (short i = 0; i < extBChar.length();i++) {
String code = Long.toHexString(extBChar.charAt(i));

sb.append("0x" + code);
sb.append(" ");
}*/

return sb.toString();

}

// private method(s) -- END public static void main(String[] args) throws Exception {
HashMap extBCharMap = new HashMap();

ByteArrayOutputStream surrogatePair = new ByteArrayOutputStream();

int a = Integer.decode("0x" + Long.toHexString(Long.parseLong("11011000", 2)));

System.out.println("a = " + a);

int b = Integer.decode("0x" + Long.toHexString(Long.parseLong("01000000", 2)));

surrogatePair.write(a);

surrogatePair.write(b);

a = Integer.decode("0x" + Long.toHexString(Long.parseLong("11011100", 2)));
b = Integer.decode("0x" + Long.toHexString(Long.parseLong("11011001", 2)));

surrogatePair.write(a);
surrogatePair.write(b);
surrogatePair.close();

extBCharMap.put("1", surrogatePair.toString("UTF-16BE"));
extBCharMap.put("2", surrogatePair.toString("UTF-16BE"));

replacePlaceHolder(extBCharMap, "bb${1}uiuiu", "CASEAP", "CHI_SURNAME", "1235", "AP_CHI_FIRST");

System.out.println("size of extBCharMap: " + extBCharMap.size());
checkExtBCharMap(extBCharMap);
}