[#QFJ-382] Foreign Language Support - Multibyte Characters - Chinese

[QFJ-382] Foreign Language Support - Multibyte Characters - Chinese Created: 09/Dec/08 Updated: 02/Nov/15 Resolved: 09/Jun/14
Status:	Closed
Project:	QuickFIX/J
Component/s:	Engine
Affects Version/s:	1.3.3
Fix Version/s:	1.6.0

Type:

Improvement

Priority:

Default

Reporter:

Jason Aubrey

Assignee:

amichair

Resolution:

Fixed

Votes:

3

Labels:

encoding

Environment:

All

Attachments:

Changes.zip

Issue Links:

Duplicate
duplicates	~~QFJ-38~~	FIX Message support double-byte charset.	Closed
is duplicated by	~~QFJ-666~~	FIXMessageEncoder got BufferOverflowE...	Closed
Relates
relates to	QFJ-789	Fully support alternate encodings (ch...	Open
is related to	~~QFJ-631~~	Wrong checksum calculation in "quickf...	Closed
is related to	~~QFJ-282~~	FIXMessageEncoder#encode() may throws...	Closed

Description

I need QFJ to support Chinese characters. So I modified my working copy to add this functionality/tests. I could simply commit the changes but I don't have write access to the repository. I'll just post the relevant changes here for now. It'd be nice if I could simply add all the diffs as attachments to this message.

Message.java
<pre>
public String toString() {

header.setField(new BodyLength(bodyLength()));
+ try { + header.setField(new BodyLength(bodyLength())); + }
catch(UnsupportedEncodingException e)
{ + LoggerFactory.getLogger(getClass()).error("toString failed, unsupported encoding", e); + return ""; + }
trailer.setField(new CheckSum(checkSum()));

StringBuffer sb = new StringBuffer();
@@ -138,7 +145,7 @@
return sb.toString();
}

public int bodyLength() {
+ public int bodyLength() throws UnsupportedEncodingException { return header.calculateLength() + calculateLength() + trailer.calculateLength(); }
</pre>

Field.java
<pre>

/package/ int getLength() {
+ /package/ int getLength() throws UnsupportedEncodingException { calculate(); - return data.length()+1; + return data.getBytes(CharsetSupport.getCharset()).length+1; }
</pre>

FieldTest.java
<pre>

public void testFieldCalculations() {
+ public void testFieldCalculationsEnglish() throws Exception { Field<String> object = new Field<String>(12, "VALUE"); object.setObject("VALUE"); assertEquals("12=VALUE", object.toString()); @@ -63,6 +65,22 @@ assertEquals(544, object.getTotal()); assertEquals(9, object.getLength()); }
+
+ public void testFieldCalculationsChinese() throws Exception {
+ try { + CharsetSupport.setCharset("UTF-8"); + int tag = 13; + String value = "\u6D4B\u9A8C\u6570\u636E"; + Field<String> object = new Field<String>(tag, value); + assertEquals(tag + "=" + value, object.toString()); + assertEquals(119127, object.getTotal()); + assertEquals(16, object.getLength()); + } catch(Exception e) { + throw e; + } finally { + CharsetSupport.setCharset(CharsetSupport.getDefaultCharset()); + }
+ }
</pre>

FIXMessageEncoder.java
<pre>
- public void testFieldCalculations() {
+ public void testFieldCalculationsEnglish() throws Exception { Field<String> object = new Field<String>(12, "VALUE"); object.setObject("VALUE"); assertEquals("12=VALUE", object.toString());@@ -63,6 +65,22 @@ assertEquals(544, object.getTotal()); assertEquals(9, object.getLength()); }
+
+ public void testFieldCalculationsChinese() throws Exception

Unknown macro: {+ try { + CharsetSupport.setCharset("UTF-8"); + int tag = 13; + String value = "\u6D4B\u9A8C\u6570\u636E"; + Field<String> object = new Field<String>(tag, value); + assertEquals(tag + "=" + value, object.toString()); + assertEquals(119127, object.getTotal()); + assertEquals(16, object.getLength()); + } catch(Exception e) { + throw e; + } finally { + CharsetSupport.setCharset(CharsetSupport.getDefaultCharset()); + }
+ }
</pre>

FIXMessageEncoderTest.java
<pre>

public void testWesternEuropeanEncoding() throws Exception {
- // Default encoding, should work
- doEncodingTest();
-
- try { - // This will break because of European characters - CharsetSupport.setCharset("US-ASCII"); - doEncodingTest(); - } catch (ComparisonFailure e) { - // expected - } finally { - CharsetSupport.setCharset(CharsetSupport.getDefaultCharset()); - }
+ // äbcfödçé
+ String input = "\u00E4bcf\u00F6d\u00E7\u00E9";
+
+ // Default encoding, should work
+ doEncodingTest(input);
+
+ try { + // This will break because of European characters + CharsetSupport.setCharset("US-ASCII"); + doEncodingTest(input); + } catch (ComparisonFailure e) { + // expected + } finally {+ CharsetSupport.setCharset(CharsetSupport.getDefaultCharset());+ } }

private void doEncodingTest() throws ProtocolCodecException, UnsupportedEncodingException {
// äbcfödçé
String headline = "\u00E4bcf\u00F6d\u00E7\u00E9";
+ public void testChineseEncoding() throws Exception
Unknown macro: {+ // "test data" in Chinese+ String input = "u6D4Bu9A8Cu6570u636E";+ + try { + // This will break because the characters cannot be represented properly + doEncodingTest(input); + } catch (ComparisonFailure e) { + // expected + }++ try { + // This should work + CharsetSupport.setCharset("UTF-8"); + doEncodingTest(input); + } finally { + CharsetSupport.setCharset(CharsetSupport.getDefaultCharset()); + }
+ }
+
+ private void doEncodingTest(String input) throws ProtocolCodecException, UnsupportedEncodingException { News news = new News(); - news.set(new Headline(headline)); + news.set(new Headline(input)); FIXMessageEncoder encoder = new FIXMessageEncoder(); ProtocolEncoderOutputForTest encoderOut = new ProtocolEncoderOutputForTest(); encoder.encode(null, news, encoderOut); @@ -84,11 +105,24 @@ }
}

- public void testEncodingString() throws Exception {
+ public void testEncodingStringEnglish() throws Exception { FIXMessageEncoder encoder = new FIXMessageEncoder(); ProtocolEncoderOutputForTest protocolEncoderOutputForTest = new ProtocolEncoderOutputForTest(); encoder.encode(null, "abcd", protocolEncoderOutputForTest); assertEquals(4, protocolEncoderOutputForTest.buffer.limit()); }
+
+ public void testEncodingStringChinese() throws Exception {
+ FIXMessageEncoder encoder = new FIXMessageEncoder();
+ ProtocolEncoderOutputForTest protocolEncoderOutputForTest = new ProtocolEncoderOutputForTest();
+
+ try { + CharsetSupport.setCharset("UTF-8"); + encoder.encode(null, "\u6D4B\u9A8C\u6570\u636E", protocolEncoderOutputForTest); + } finally {+ CharsetSupport.setCharset(CharsetSupport.getDefaultCharset());+ }+ assertEquals(12, protocolEncoderOutputForTest.buffer.limit());+ }

}
</pre>

Comments

Comment by Jason Aubrey [ 09/Dec/08 ]

The revision number of my working copy is 892 (was head revision last week at least).

Comment by Steve Bate [ 09/Dec/08 ]

Hi Jason,

Thanks for the patches. Have you verified that the checksum calculations work with these changes? The current calculation sums characters which are assumed to be 1-byte. This assumption is made to avoid the need to transcode the message string to bytes for the purpose of calculating the checksum.

Comment by Jason Aubrey [ 09/Dec/08 ]

Hi Steve,

I think there may have been some checksum related exceptions initially when sending multibyte characters due to how the buffer was allocated (based on character counts instead of byte count). However, I didn't modify the checksum code (shown below) since it still works in the same basic way.

private int checkSum(String s) {
int offset = s.lastIndexOf("\00110=");
int sum = 0;
for (int i = 0; i < offset; i++)

{ sum += s.charAt(i); }

return (sum + 1) % 256;
}

The only difference in behavior is that each character's value can be much larger than simple ASCII values. For example in utf-8, "\u65E0\u6548\u7684\u7528" which is equivalent to "无效的用" has four characters that are each four hex digits long. So if each of these were FFFF then the sum would be 4 * FFFF = 3FFFC (262,140 in base 10). Given that the sum is stored as an integer the only risk seems to be overflow, which would occur after 2,147,483,647. With four byte character encoding, the overflow would only occur after 8,192 characters (i.e. 2,147,483,647 / 262,140 ) and this assumes each character is FFFF which it would likely not be. I don't think this is a concern though. If it were a concern, 'sum' could be stored as a larger type. I didn't give any thought to the '% 256' logic since I figured it's unique enough.

Comment by amichair [ 09/Jun/14 ]

The above analysis is incorrect, since the checksum should be performed on the encoded bytes, not the source (UTF-16) characters. btw, to avoid an overflow you can use '& 0xFF' instead of '% 256'.

In any case, this is now fixed - thanks for the patches, which helped along the way.

Currently setting a charset via CharsetSupport should work with any charset that is a superset of ASCII, which luckily is most of them.

Comment by Kou Jun [ 02/Nov/15 ]

is there any sample code to send and receive Chinese characters ?
it seems still can't process Chinese characters properly!

Generated at Sat Apr 27 12:46:27 UTC 2024 using JIRA 7.5.2#75007-sha1:9f5725bb824792b3230a5d8716f0c13e296a3cae.