QuickFIX/J

Foreign Language Support - Multibyte Characters - Chinese

Details

  • Type: Improvement Improvement
  • Status: Open Open
  • Priority: Default Default
  • Resolution: Unresolved
  • Affects Version/s: 1.3.3
  • Fix Version/s: Future Releases
  • Component/s: Message Generation
  • Labels:
    None
  • Environment:
    All

Description

I need QFJ to support Chinese characters. So I modified my working copy to add this functionality/tests. I could simply commit the changes but I don't have write access to the repository. I'll just post the relevant changes here for now. It'd be nice if I could simply add all the diffs as attachments to this message.

Message.java
<pre>
     public String toString() {
- header.setField(new BodyLength(bodyLength()));
+ try {
+ header.setField(new BodyLength(bodyLength()));
+ } catch(UnsupportedEncodingException e) {
+ LoggerFactory.getLogger(getClass()).error("toString failed, unsupported encoding", e);
+ return "";
+ }
         trailer.setField(new CheckSum(checkSum()));
 
         StringBuffer sb = new StringBuffer();
@@ -138,7 +145,7 @@
         return sb.toString();
     }
 
- public int bodyLength() {
+ public int bodyLength() throws UnsupportedEncodingException {
         return header.calculateLength() + calculateLength() + trailer.calculateLength();
     }
</pre>

Field.java
<pre>
- /*package*/ int getLength() {
+ /*package*/ int getLength() throws UnsupportedEncodingException {
         calculate();
- return data.length()+1;
+ return data.getBytes(CharsetSupport.getCharset()).length+1;
     }
</pre>

FieldTest.java
<pre>
- public void testFieldCalculations() {
+ public void testFieldCalculationsEnglish() throws Exception {
         Field<String> object = new Field<String>(12, "VALUE");
         object.setObject("VALUE");
         assertEquals("12=VALUE", object.toString());
@@ -63,6 +65,22 @@
         assertEquals(544, object.getTotal());
         assertEquals(9, object.getLength());
     }
+
+ public void testFieldCalculationsChinese() throws Exception {
+ try {
+ CharsetSupport.setCharset("UTF-8");
+ int tag = 13;
+ String value = "\u6D4B\u9A8C\u6570\u636E";
+ Field<String> object = new Field<String>(tag, value);
+ assertEquals(tag + "=" + value, object.toString());
+ assertEquals(119127, object.getTotal());
+ assertEquals(16, object.getLength());
+ } catch(Exception e) {
+ throw e;
+ } finally {
+ CharsetSupport.setCharset(CharsetSupport.getDefaultCharset());
+ }
+ }
</pre>

FIXMessageEncoder.java
<pre>
- public void testFieldCalculations() {
+ public void testFieldCalculationsEnglish() throws Exception {
         Field<String> object = new Field<String>(12, "VALUE");
         object.setObject("VALUE");
         assertEquals("12=VALUE", object.toString());
@@ -63,6 +65,22 @@
         assertEquals(544, object.getTotal());
         assertEquals(9, object.getLength());
     }
+
+ public void testFieldCalculationsChinese() throws Exception {
+ try {
+ CharsetSupport.setCharset("UTF-8");
+ int tag = 13;
+ String value = "\u6D4B\u9A8C\u6570\u636E";
+ Field<String> object = new Field<String>(tag, value);
+ assertEquals(tag + "=" + value, object.toString());
+ assertEquals(119127, object.getTotal());
+ assertEquals(16, object.getLength());
+ } catch(Exception e) {
+ throw e;
+ } finally {
+ CharsetSupport.setCharset(CharsetSupport.getDefaultCharset());
+ }
+ }
</pre>

FIXMessageEncoderTest.java
<pre>

     public void testWesternEuropeanEncoding() throws Exception {
- // Default encoding, should work
- doEncodingTest();
-
- try {
- // This will break because of European characters
- CharsetSupport.setCharset("US-ASCII");
- doEncodingTest();
- } catch (ComparisonFailure e) {
- // expected
- } finally {
- CharsetSupport.setCharset(CharsetSupport.getDefaultCharset());
- }
+ // äbcfödçé
+ String input = "\u00E4bcf\u00F6d\u00E7\u00E9";
+
+ // Default encoding, should work
+ doEncodingTest(input);
+
+ try {
+ // This will break because of European characters
+ CharsetSupport.setCharset("US-ASCII");
+ doEncodingTest(input);
+ } catch (ComparisonFailure e) {
+ // expected
+ } finally {
+ CharsetSupport.setCharset(CharsetSupport.getDefaultCharset());
+ }
     }
 
- private void doEncodingTest() throws ProtocolCodecException, UnsupportedEncodingException {
- // äbcfödçé
- String headline = "\u00E4bcf\u00F6d\u00E7\u00E9";
+ public void testChineseEncoding() throws Exception {
+ // "test data" in Chinese
+ String input = "\u6D4B\u9A8C\u6570\u636E";
+
+ try {
+ // This will break because the characters cannot be represented properly
+ doEncodingTest(input);
+ } catch (ComparisonFailure e) {
+ // expected
+ }
+
+ try {
+ // This should work
+ CharsetSupport.setCharset("UTF-8");
+ doEncodingTest(input);
+ } finally {
+ CharsetSupport.setCharset(CharsetSupport.getDefaultCharset());
+ }
+ }
+
+ private void doEncodingTest(String input) throws ProtocolCodecException, UnsupportedEncodingException {
         News news = new News();
- news.set(new Headline(headline));
+ news.set(new Headline(input));
         FIXMessageEncoder encoder = new FIXMessageEncoder();
         ProtocolEncoderOutputForTest encoderOut = new ProtocolEncoderOutputForTest();
         encoder.encode(null, news, encoderOut);
@@ -84,11 +105,24 @@
         }
     }
 
- public void testEncodingString() throws Exception {
+ public void testEncodingStringEnglish() throws Exception {
         FIXMessageEncoder encoder = new FIXMessageEncoder();
         ProtocolEncoderOutputForTest protocolEncoderOutputForTest = new ProtocolEncoderOutputForTest();
         encoder.encode(null, "abcd", protocolEncoderOutputForTest);
         assertEquals(4, protocolEncoderOutputForTest.buffer.limit());
     }
+
+ public void testEncodingStringChinese() throws Exception {
+ FIXMessageEncoder encoder = new FIXMessageEncoder();
+ ProtocolEncoderOutputForTest protocolEncoderOutputForTest = new ProtocolEncoderOutputForTest();
+
+ try {
+ CharsetSupport.setCharset("UTF-8");
+ encoder.encode(null, "\u6D4B\u9A8C\u6570\u636E", protocolEncoderOutputForTest);
+ } finally {
+ CharsetSupport.setCharset(CharsetSupport.getDefaultCharset());
+ }
+ assertEquals(12, protocolEncoderOutputForTest.buffer.limit());
+ }
 
 }
</pre>

Issue Links

Activity

Hide
Jason Aubrey added a comment - 09/Dec/08 8:17 PM

The revision number of my working copy is 892 (was head revision last week at least).

Show
Jason Aubrey added a comment - 09/Dec/08 8:17 PM The revision number of my working copy is 892 (was head revision last week at least).
Hide
Steve Bate added a comment - 09/Dec/08 9:18 PM

Hi Jason,

Thanks for the patches. Have you verified that the checksum calculations work with these changes? The current calculation sums characters which are assumed to be 1-byte. This assumption is made to avoid the need to transcode the message string to bytes for the purpose of calculating the checksum.

Show
Steve Bate added a comment - 09/Dec/08 9:18 PM Hi Jason, Thanks for the patches. Have you verified that the checksum calculations work with these changes? The current calculation sums characters which are assumed to be 1-byte. This assumption is made to avoid the need to transcode the message string to bytes for the purpose of calculating the checksum.
Hide
Jason Aubrey added a comment - 09/Dec/08 10:30 PM

Hi Steve,

I think there may have been some checksum related exceptions initially when sending multibyte characters due to how the buffer was allocated (based on character counts instead of byte count). However, I didn't modify the checksum code (shown below) since it still works in the same basic way.

private int checkSum(String s) {
int offset = s.lastIndexOf("\00110=");
int sum = 0;
for (int i = 0; i < offset; i++) { sum += s.charAt(i); }
return (sum + 1) % 256;
}

The only difference in behavior is that each character's value can be much larger than simple ASCII values. For example in utf-8, "\u65E0\u6548\u7684\u7528" which is equivalent to "无效的用" has four characters that are each four hex digits long. So if each of these were FFFF then the sum would be 4 * FFFF = 3FFFC (262,140 in base 10). Given that the sum is stored as an integer the only risk seems to be overflow, which would occur after 2,147,483,647. With four byte character encoding, the overflow would only occur after 8,192 characters (i.e. 2,147,483,647 / 262,140 ) and this assumes each character is FFFF which it would likely not be. I don't think this is a concern though. If it were a concern, 'sum' could be stored as a larger type. I didn't give any thought to the '% 256' logic since I figured it's unique enough.

Show
Jason Aubrey added a comment - 09/Dec/08 10:30 PM Hi Steve, I think there may have been some checksum related exceptions initially when sending multibyte characters due to how the buffer was allocated (based on character counts instead of byte count). However, I didn't modify the checksum code (shown below) since it still works in the same basic way. private int checkSum(String s) { int offset = s.lastIndexOf("\00110="); int sum = 0; for (int i = 0; i < offset; i++) { sum += s.charAt(i); } return (sum + 1) % 256; } The only difference in behavior is that each character's value can be much larger than simple ASCII values. For example in utf-8, "\u65E0\u6548\u7684\u7528" which is equivalent to "无效的用" has four characters that are each four hex digits long. So if each of these were FFFF then the sum would be 4 * FFFF = 3FFFC (262,140 in base 10). Given that the sum is stored as an integer the only risk seems to be overflow, which would occur after 2,147,483,647. With four byte character encoding, the overflow would only occur after 8,192 characters (i.e. 2,147,483,647 / 262,140 ) and this assumes each character is FFFF which it would likely not be. I don't think this is a concern though. If it were a concern, 'sum' could be stored as a larger type. I didn't give any thought to the '% 256' logic since I figured it's unique enough.

People

Vote (1)
Watch (3)

Dates

  • Created:
    09/Dec/08 8:14 PM
    Updated:
    Thursday 5:03 PM