QuickFIX/J

Explicitly control the String encoding of the FIX messages

Details

  • Type: Bug Bug
  • Status: Closed Closed
  • Priority: Major Major
  • Resolution: Fixed
  • Affects Version/s: 1.0.0 Final, 1.0.1, 1.0.2, 1.0.3
  • Fix Version/s: 1.2.0
  • Component/s: None
  • Labels:
    None

Description

This is currently and issue either with multibyte characters in FIX messages (somewhat rare, but it's been requested by a user) and with checksum problem with unintentional nonprinting characters in messages. The default encoding differs by platform so it's not a good idea to use the implicit default encoding.

Issue Links

Activity

Hide
Steve Bate added a comment - 22/Sep/06 3:16 PM

We'll need to modify both the FIX decoder and encoder.

Show
Steve Bate added a comment - 22/Sep/06 3:16 PM We'll need to modify both the FIX decoder and encoder.
Hide
Brad Harvey added a comment - 23/Sep/06 5:38 AM

I think the checksum validation/calculation should be in the decoder/encoder on the bytes just received/about to be sent. This makes it possible to correctly calculate the checksum on messages you receive which contain byte sequences that can't be mapped to your charset.

Show
Brad Harvey added a comment - 23/Sep/06 5:38 AM I think the checksum validation/calculation should be in the decoder/encoder on the bytes just received/about to be sent. This makes it possible to correctly calculate the checksum on messages you receive which contain byte sequences that can't be mapped to your charset.
Hide
Brad Harvey added a comment - 07/Oct/06 1:02 PM

I've attached some tests I was playing with to help me understand my checksum problem. I was surprised by the result of testDecodeDoubleByte - it passes in trunk - so I thought it was worth sharing. Most of the other tests aren't as interesting.

It seems that the change to FIXMessageDecoder.getMessageString to use the new String(byte[], charsetName) constructor has fixed the issue I was seeing with double bytes causing checksum failures in 1.0.2. From the javadoc:

The behavior of this constructor when the given bytes are not valid in the given charset is unspecified. The java.nio.charset.CharsetDecoder class should be used when more control over the decoding process is required.

It seems that the "unspecified" behaviour is to allow the invalid bytes through unchanged. What I found surprising was that there doesn't seem to be a way to emulate this behaviour using the Ignore/Replace/Report mechanism of CharsetDecoder. So hopefully the unspecified behaviour is consistent across JVMs (and alternate charsets?)!

Show
Brad Harvey added a comment - 07/Oct/06 1:02 PM I've attached some tests I was playing with to help me understand my checksum problem. I was surprised by the result of testDecodeDoubleByte - it passes in trunk - so I thought it was worth sharing. Most of the other tests aren't as interesting. It seems that the change to FIXMessageDecoder.getMessageString to use the new String(byte[], charsetName) constructor has fixed the issue I was seeing with double bytes causing checksum failures in 1.0.2. From the javadoc: The behavior of this constructor when the given bytes are not valid in the given charset is unspecified. The java.nio.charset.CharsetDecoder class should be used when more control over the decoding process is required. It seems that the "unspecified" behaviour is to allow the invalid bytes through unchanged. What I found surprising was that there doesn't seem to be a way to emulate this behaviour using the Ignore/Replace/Report mechanism of CharsetDecoder. So hopefully the unspecified behaviour is consistent across JVMs (and alternate charsets?)!
Hide
Steve Bate added a comment - 10/Nov/06 4:09 AM

I'm moving this to a post 1.1 release. I can see what needs to be done but the trick will be finding a way to do that doesn't impact performance too negatively. The comments and tests have addressed the checksum and related unsigned byte issues but there are also issues with message length calculations when using MB character sets.

Show
Steve Bate added a comment - 10/Nov/06 4:09 AM I'm moving this to a post 1.1 release. I can see what needs to be done but the trick will be finding a way to do that doesn't impact performance too negatively. The comments and tests have addressed the checksum and related unsigned byte issues but there are also issues with message length calculations when using MB character sets.
Hide
Brad Harvey added a comment - 10/Nov/06 4:36 AM

When I looked at it I actually wondered if two options might be needed - the "old" one (that so far seems to do the job for most people) and a new one that handles double byte chars but may be slower. Having said that, I'd imagine choice of MessageStore would have a bigger performance impact.

Show
Brad Harvey added a comment - 10/Nov/06 4:36 AM When I looked at it I actually wondered if two options might be needed - the "old" one (that so far seems to do the job for most people) and a new one that handles double byte chars but may be slower. Having said that, I'd imagine choice of MessageStore would have a bigger performance impact.
Hide
Steve Bate added a comment - 10/Nov/06 5:05 AM

That's basically the approach I've been taking. One issue is that the way the body length and checksum is currently calculated, the specified charset would have to be pushed down to the field level and each field would need to do a character decoding to a byte array to determine it's length in bytes and it's contribution to the checksum. If I do that and I have the conditional to provide the current behavior if the default charset (US-ASCII) is being used, then it might be reasonable.

Show
Steve Bate added a comment - 10/Nov/06 5:05 AM That's basically the approach I've been taking. One issue is that the way the body length and checksum is currently calculated, the specified charset would have to be pushed down to the field level and each field would need to do a character decoding to a byte array to determine it's length in bytes and it's contribution to the checksum. If I do that and I have the conditional to provide the current behavior if the default charset (US-ASCII) is being used, then it might be reasonable.
Hide
Jörg Thönnes added a comment - 21/Mar/07 4:09 PM

We have issues with German and Italian character sets. These character sets use
values beyound the normal ASCII range (0..127) for QF/ 1.0.3.

After the log line for the incoming log, the error displayed is:

Mar 21, 2007 12:01:05 PM quickfix.mina.AbstractIoHandler messageReceived
SEVERE: Invalid message: Expected CheckSum=199, Received CheckSum=182

At the moment, we have no way to check automatically for this error. Is there anyway to
use a MINA error handler?

Show
Jörg Thönnes added a comment - 21/Mar/07 4:09 PM We have issues with German and Italian character sets. These character sets use values beyound the normal ASCII range (0..127) for QF/ 1.0.3. After the log line for the incoming log, the error displayed is: Mar 21, 2007 12:01:05 PM quickfix.mina.AbstractIoHandler messageReceived SEVERE: Invalid message: Expected CheckSum=199, Received CheckSum=182 At the moment, we have no way to check automatically for this error. Is there anyway to use a MINA error handler?
Hide
Jörg Thönnes added a comment - 31/Mar/07 12:16 AM

Looking at some CharsetDecoder examples, I am wondering whether setting the encoding/decoding char set to Latin-1
would help in our case.

Show
Jörg Thönnes added a comment - 31/Mar/07 12:16 AM Looking at some CharsetDecoder examples, I am wondering whether setting the encoding/decoding char set to Latin-1 would help in our case.
Hide
Jörg Thönnes added a comment - 31/Mar/07 11:58 PM

In the code, I found two places where the char set name "US-ASCII" is used.

To check the conversion of umlauts, I used the following snippet:

//String charSet = "US-ASCII";
//String charSet = "ISO_8859-1";
System.out.println(charSet);
String s = "äbc";
int offset = s.length();
int sum = 0;
for (int i = 0; i < offset; i++) {
int val = s.charAt;
System.out.print(val+"+");
sum+=val;
};
System.out.println( "Checksum: " + sum % 256 );
sum=0;
byte[] bb = s.getBytes( charSet );
sum=0; offset=bb.length;
for (int i = 0; i < offset; i++) {
int val = bb[i];
System.out.print(val+"+");
sum+=val;
}
System.out.println( "Checksum: " + sum % 256 );
System.out.println();

The output is

ISO_8859-1
228+98+99+Checksum: 169
-28+98+99+Checksum: 169

US-ASCII
228+98+99+Checksum: 169
63+98+99+Checksum: 4

So in our case, setting the char set name to "ISO_8859-1" would help.

That is, we need a char set configurable globally or per session.

Show
Jörg Thönnes added a comment - 31/Mar/07 11:58 PM In the code, I found two places where the char set name "US-ASCII" is used. To check the conversion of umlauts, I used the following snippet: //String charSet = "US-ASCII"; //String charSet = "ISO_8859-1"; System.out.println(charSet); String s = "äbc"; int offset = s.length(); int sum = 0; for (int i = 0; i < offset; i++) { int val = s.charAt; System.out.print(val+"+"); sum+=val; }; System.out.println( "Checksum: " + sum % 256 ); sum=0; byte[] bb = s.getBytes( charSet ); sum=0; offset=bb.length; for (int i = 0; i < offset; i++) { int val = bb[i]; System.out.print(val+"+"); sum+=val; } System.out.println( "Checksum: " + sum % 256 ); System.out.println(); The output is ISO_8859-1 228+98+99+Checksum: 169 -28+98+99+Checksum: 169 US-ASCII 228+98+99+Checksum: 169 63+98+99+Checksum: 4 So in our case, setting the char set name to "ISO_8859-1" would help. That is, we need a char set configurable globally or per session.
Hide
Steve Bate added a comment - 02/Apr/07 2:07 AM

Is the US-ASCII you saw used in the Multibyte branch? I couldn't find it in the trunk and I don't currently have a branch workspace checked out.

Show
Steve Bate added a comment - 02/Apr/07 2:07 AM Is the US-ASCII you saw used in the Multibyte branch? I couldn't find it in the trunk and I don't currently have a branch workspace checked out.
Hide
Jörg Thönnes added a comment - 02/Apr/07 7:08 AM

FIXMessageDecoder.java, SVN 555:
public FIXMessageDecoder() { this("US-ASCII"); }

FIXMessageEncoder.java, SVN 555:
public FIXMessageDecoder() { this("US-ASCII"); } }

Show
Jörg Thönnes added a comment - 02/Apr/07 7:08 AM FIXMessageDecoder.java, SVN 555: public FIXMessageDecoder() { this("US-ASCII"); } FIXMessageEncoder.java, SVN 555: public FIXMessageDecoder() { this("US-ASCII"); } }
Hide
Jörg Thönnes added a comment - 11/Apr/07 10:23 PM

This little Java program shows the current default charset:

public class ShowDefaultCharset {
public static void main( String[] args ) { System.out.println(java.nio.charset.Charset.defaultCharset().toString()); }
}

On Linux, the output depends on the setting of the LANG environment variable:

LANG=POSIX java -cp . ShowDefaultCharset
US-ASCII

LANG=en_US.iso88591 java -cp . ShowDefaultCharset
ISO-8859-1

LANG=de_DE.utf8 java -cp . ShowDefaultCharset
UTF-8

Since the ISO-8859-1 character sets covers the whole 8 bits of a byte, it should work well for most 1 byte non- ASCII charsets.

Show
Jörg Thönnes added a comment - 11/Apr/07 10:23 PM This little Java program shows the current default charset: public class ShowDefaultCharset { public static void main( String[] args ) { System.out.println(java.nio.charset.Charset.defaultCharset().toString()); } } On Linux, the output depends on the setting of the LANG environment variable: LANG=POSIX java -cp . ShowDefaultCharset US-ASCII LANG=en_US.iso88591 java -cp . ShowDefaultCharset ISO-8859-1 LANG=de_DE.utf8 java -cp . ShowDefaultCharset UTF-8 Since the ISO-8859-1 character sets covers the whole 8 bits of a byte, it should work well for most 1 byte non- ASCII charsets.
Hide
Jörg Thönnes added a comment - 11/Apr/07 10:46 PM

Here is an extended java program which also checks the byte to character mappings:

public class ShowDefaultCharset {
public static void main( String[] args ) {
System.out.println(java.nio.charset.Charset.defaultCharset().toString());

byte[] b = new byte[256];
for ( int i=0; i<256; i++ ) { b[i] = (byte)i; }
final String x = new String ( b );
System.out.println( "Looking for non 1:1 byte to character mappings..." );
for ( int i=0; i<x.length(); i++ ) {
if ( x.charAt( i ) != i ) { System.out.println( i + " -> " + (int)x.charAt ( i ) ); }
}
System.out.println("Done.");
}
}

Applying this program to ISO_8859-1 shows no change, ie this charset is really 1:1.

This means that the default character set for QuickFIX/J should be ISO_8859-1.

For multi-byte character sets, extra effort has to be done.

Show
Jörg Thönnes added a comment - 11/Apr/07 10:46 PM Here is an extended java program which also checks the byte to character mappings: public class ShowDefaultCharset { public static void main( String[] args ) { System.out.println(java.nio.charset.Charset.defaultCharset().toString()); byte[] b = new byte[256]; for ( int i=0; i<256; i++ ) { b[i] = (byte)i; } final String x = new String ( b ); System.out.println( "Looking for non 1:1 byte to character mappings..." ); for ( int i=0; i<x.length(); i++ ) { if ( x.charAt( i ) != i ) { System.out.println( i + " -> " + (int)x.charAt ( i ) ); } } System.out.println("Done."); } } Applying this program to ISO_8859-1 shows no change, ie this charset is really 1:1. This means that the default character set for QuickFIX/J should be ISO_8859-1. For multi-byte character sets, extra effort has to be done.
Hide
Jörg Thönnes added a comment - 11/Apr/07 10:56 PM

Another extension checks for the reverse direction. Due to the signed integer, I add 256 modulo 256 to get the positive value:

public class ShowDefaultCharset {
public static void main( String[] args ) {
System.out.println(java.nio.charset.Charset.defaultCharset().toString());

byte[] b = new byte[256];
for ( int i=0; i<256; i++ ) { b[i] = (byte)i; }
final String x = new String ( b );
final byte[] bx = x.getBytes();
System.out.println( "Looking for non 1:1 byte to character mappings..." );
for ( int i=0; i<x.length(); i++ ) {
if ( x.charAt( i ) != i ) { System.out.println( i + " -> " + (int)x.charAt ( i ) ); }
final int bb = (bx[i]+256) % 256;
if ( bb != i ) { System.out.println( i + " <- " + bb ); }
}
System.out.println("Done.");
}
}

For ISO_8859-1, this works fine.

Show
Jörg Thönnes added a comment - 11/Apr/07 10:56 PM Another extension checks for the reverse direction. Due to the signed integer, I add 256 modulo 256 to get the positive value: public class ShowDefaultCharset { public static void main( String[] args ) { System.out.println(java.nio.charset.Charset.defaultCharset().toString()); byte[] b = new byte[256]; for ( int i=0; i<256; i++ ) { b[i] = (byte)i; } final String x = new String ( b ); final byte[] bx = x.getBytes(); System.out.println( "Looking for non 1:1 byte to character mappings..." ); for ( int i=0; i<x.length(); i++ ) { if ( x.charAt( i ) != i ) { System.out.println( i + " -> " + (int)x.charAt ( i ) ); } final int bb = (bx[i]+256) % 256; if ( bb != i ) { System.out.println( i + " <- " + bb ); } } System.out.println("Done."); } } For ISO_8859-1, this works fine.
Hide
Jörg Thönnes added a comment - 11/Apr/07 11:10 PM

I suggest to make this patch to the 1.1.0 release to make the FIXMessageEncoder equivalent to the FIXMessageDecoder:

Index: /export/home/joerg/workspace/quickfixj/core/src/main/java/quickfix/mina/message/FIXMessageEncoder.java
===================================================================
— /export/home/joerg/workspace/quickfixj/core/src/main/java/quickfix/mina/message/FIXMessageEncoder.java (revision 617)
+++ /export/home/joerg/workspace/quickfixj/core/src/main/java/quickfix/mina/message/FIXMessageEncoder.java (working copy)
@@ -19,6 +19,7 @@

package quickfix.mina.message;

+import java.io.UnsupportedEncodingException;
import java.util.HashSet;
import java.util.Set;

@@ -50,6 +51,8 @@
return TYPES;
}

+ private String charsetName = "ISO_8859-1";
+
public void encode(IoSession session, Object message, ProtocolEncoderOutput out)
throws ProtocolCodecException { String fixMessageString; @@ -65,7 +68,11 @@ }

ByteBuffer buffer = ByteBuffer.allocate(fixMessageString.length());

  • buffer.put(fixMessageString.getBytes());
    + try { + buffer.put(fixMessageString.getBytes(charsetName)); + } catch (UnsupportedEncodingException e) { + throw new ProtocolCodecException( e ); + }
    buffer.flip();
    out.write(buffer);
    }

Steve, what do you think?

Show
Jörg Thönnes added a comment - 11/Apr/07 11:10 PM I suggest to make this patch to the 1.1.0 release to make the FIXMessageEncoder equivalent to the FIXMessageDecoder: Index: /export/home/joerg/workspace/quickfixj/core/src/main/java/quickfix/mina/message/FIXMessageEncoder.java =================================================================== — /export/home/joerg/workspace/quickfixj/core/src/main/java/quickfix/mina/message/FIXMessageEncoder.java (revision 617) +++ /export/home/joerg/workspace/quickfixj/core/src/main/java/quickfix/mina/message/FIXMessageEncoder.java (working copy) @@ -19,6 +19,7 @@ package quickfix.mina.message; +import java.io.UnsupportedEncodingException; import java.util.HashSet; import java.util.Set; @@ -50,6 +51,8 @@ return TYPES; } + private String charsetName = "ISO_8859-1"; + public void encode(IoSession session, Object message, ProtocolEncoderOutput out) throws ProtocolCodecException { String fixMessageString; @@ -65,7 +68,11 @@ } ByteBuffer buffer = ByteBuffer.allocate(fixMessageString.length());
  • buffer.put(fixMessageString.getBytes()); + try { + buffer.put(fixMessageString.getBytes(charsetName)); + } catch (UnsupportedEncodingException e) { + throw new ProtocolCodecException( e ); + } buffer.flip(); out.write(buffer); }
Steve, what do you think?
Hide
Jörg Thönnes added a comment - 11/Apr/07 11:39 PM

The suggested patch would complement the changes made with revision 527 in the FIXMessageDecoder.
For some reason, the equivalent change in the FIXMessageEncoder is missing.

Show
Jörg Thönnes added a comment - 11/Apr/07 11:39 PM The suggested patch would complement the changes made with revision 527 in the FIXMessageDecoder. For some reason, the equivalent change in the FIXMessageEncoder is missing.
Hide
Jörg Thönnes added a comment - 12/Apr/07 9:22 AM

I would also remove the setCharset methods in both FIXMessageEncoder/Decoder since
setting a Charset different from the ISO_8859-1 is not supported at the moment.

To allow other charsets, more work in the validation method has to be done.

Show
Jörg Thönnes added a comment - 12/Apr/07 9:22 AM I would also remove the setCharset methods in both FIXMessageEncoder/Decoder since setting a Charset different from the ISO_8859-1 is not supported at the moment. To allow other charsets, more work in the validation method has to be done.
Hide
Steve Bate added a comment - 12/Apr/07 2:04 PM

I originally was using US-ASCII (in the branch) because that's the FIX specification requires ASCII for nonencoded fields. However, I have no problem supporting other character sets (single byte, for now). Is there any reason why we wouldn't use UTF-8 instead of ISO_8859-1?

Show
Steve Bate added a comment - 12/Apr/07 2:04 PM I originally was using US-ASCII (in the branch) because that's the FIX specification requires ASCII for nonencoded fields. However, I have no problem supporting other character sets (single byte, for now). Is there any reason why we wouldn't use UTF-8 instead of ISO_8859-1?
Hide
Jörg Thönnes added a comment - 12/Apr/07 2:14 PM

Because ISO_8859-1 is the only charset with 1:1 mapping.
Just try my example program above with different LANG settings.

I tried US-ASCII, UTF-8 and ISO_8859-15, and every of these charsets mapping some of the bytes differently.

Therefore, exposing the setCharset() method makes sense as soon as the validate() method computes the checksums on the plain bytes.

My idea is to have a factory (FramingStrategyFactory) which returns a FramingStrategy for a given charset.
This would be the central entry point for the MINA en-/decoders and the validate() message to compute
both the checksum and the message length.

The FramingStrategy for ISO_8859-1 could simply take the String length and operate on the Java String directly, while single-byte strategies could take the String length, but compute the checksum on the bytes and multi-byte strategies also compute the length on the bytes.

The FIX Message constructor would have an optional charset argument, and if the message is sent down the link, a check is made whether the encoder charset is compatible to the Message charset. If not, the checksum and possibly the length are recomputed.

But I would promote this FramingStrategy stuff to a new JIRA issue.

Show
Jörg Thönnes added a comment - 12/Apr/07 2:14 PM Because ISO_8859-1 is the only charset with 1:1 mapping. Just try my example program above with different LANG settings. I tried US-ASCII, UTF-8 and ISO_8859-15, and every of these charsets mapping some of the bytes differently. Therefore, exposing the setCharset() method makes sense as soon as the validate() method computes the checksums on the plain bytes. My idea is to have a factory (FramingStrategyFactory) which returns a FramingStrategy for a given charset. This would be the central entry point for the MINA en-/decoders and the validate() message to compute both the checksum and the message length. The FramingStrategy for ISO_8859-1 could simply take the String length and operate on the Java String directly, while single-byte strategies could take the String length, but compute the checksum on the bytes and multi-byte strategies also compute the length on the bytes. The FIX Message constructor would have an optional charset argument, and if the message is sent down the link, a check is made whether the encoder charset is compatible to the Message charset. If not, the checksum and possibly the length are recomputed. But I would promote this FramingStrategy stuff to a new JIRA issue.
Hide
Jörg Thönnes added a comment - 12/Apr/07 2:15 PM

Today I noticed that outgoing message with national characters are encoded wrong on the first attempt,
but are re-coded correctly on resend. Will investigate further...

Show
Jörg Thönnes added a comment - 12/Apr/07 2:15 PM Today I noticed that outgoing message with national characters are encoded wrong on the first attempt, but are re-coded correctly on resend. Will investigate further...
Hide
Jörg Thönnes added a comment - 19/Apr/07 3:07 PM

OK, for outgoing messages, it currently works as follows:

1. Without PossDup=Y: The checksum is computed on the String directly and then forwarded to the encoder. The checksum is wrong.

2. Inside ResendRequest: The String is retrieved from MessageStore, where it has saved as byte[] array. In this way, the "bad" characters seem
to be replaced by "?". The checksum computed in the String domain now matched the checksum if it would have been computed on the bytes.

In summary, non ASCII characters cause exactly one resend round-trip for outgoing messages.

Show
Jörg Thönnes added a comment - 19/Apr/07 3:07 PM OK, for outgoing messages, it currently works as follows: 1. Without PossDup=Y: The checksum is computed on the String directly and then forwarded to the encoder. The checksum is wrong. 2. Inside ResendRequest: The String is retrieved from MessageStore, where it has saved as byte[] array. In this way, the "bad" characters seem to be replaced by "?". The checksum computed in the String domain now matched the checksum if it would have been computed on the bytes. In summary, non ASCII characters cause exactly one resend round-trip for outgoing messages.
Hide
Steve Bate added a comment - 25/May/07 5:34 PM

I checked in changes to allow the message encoding to be set on a JVM-wide basis. See the CharsetSupport and it's uses to see the specific changes. I apologize that I forgot to add the issue tag to the commit so they SVN changes aren't linked to this issue.

Show
Steve Bate added a comment - 25/May/07 5:34 PM I checked in changes to allow the message encoding to be set on a JVM-wide basis. See the CharsetSupport and it's uses to see the specific changes. I apologize that I forgot to add the issue tag to the commit so they SVN changes aren't linked to this issue.

People

Vote (1)
Watch (2)

Dates

  • Created:
    22/Sep/06 12:04 PM
    Updated:
    04/Jul/07 7:37 PM
    Resolved:
    25/May/07 5:34 PM