[QFJ-74] Explicitly control the String encoding of the FIX messages Created: 22/Sep/06  Updated: 09/Jun/14  Resolved: 25/May/07

Status: Closed
Project: QuickFIX/J
Component/s: None
Affects Version/s: 1.0.0 Final, 1.0.1, 1.0.2, 1.0.3
Fix Version/s: 1.2.0

Type: Bug Priority: Major
Reporter: Steve Bate Assignee: Steve Bate
Resolution: Fixed Votes: 1
Labels: None

Attachments: Java Source File ChecksumTest.java    
Issue Links:
Duplicate
duplicates QFJ-38 FIX Message support double-byte charset. Closed

 Description   

This is currently and issue either with multibyte characters in FIX messages (somewhat rare, but it's been requested by a user) and with checksum problem with unintentional nonprinting characters in messages. The default encoding differs by platform so it's not a good idea to use the implicit default encoding.



 Comments   
Comment by Steve Bate [ 22/Sep/06 ]

We'll need to modify both the FIX decoder and encoder.

Comment by Brad Harvey [ 23/Sep/06 ]

I think the checksum validation/calculation should be in the decoder/encoder on the bytes just received/about to be sent. This makes it possible to correctly calculate the checksum on messages you receive which contain byte sequences that can't be mapped to your charset.

Comment by Brad Harvey [ 07/Oct/06 ]

I've attached some tests I was playing with to help me understand my checksum problem. I was surprised by the result of testDecodeDoubleByte - it passes in trunk - so I thought it was worth sharing. Most of the other tests aren't as interesting.

It seems that the change to FIXMessageDecoder.getMessageString to use the new String(byte[], charsetName) constructor has fixed the issue I was seeing with double bytes causing checksum failures in 1.0.2. From the javadoc:

The behavior of this constructor when the given bytes are not valid in the given charset is unspecified. The java.nio.charset.CharsetDecoder class should be used when more control over the decoding process is required.

It seems that the "unspecified" behaviour is to allow the invalid bytes through unchanged. What I found surprising was that there doesn't seem to be a way to emulate this behaviour using the Ignore/Replace/Report mechanism of CharsetDecoder. So hopefully the unspecified behaviour is consistent across JVMs (and alternate charsets?)!

Comment by Steve Bate [ 10/Nov/06 ]

I'm moving this to a post 1.1 release. I can see what needs to be done but the trick will be finding a way to do that doesn't impact performance too negatively. The comments and tests have addressed the checksum and related unsigned byte issues but there are also issues with message length calculations when using MB character sets.

Comment by Brad Harvey [ 10/Nov/06 ]

When I looked at it I actually wondered if two options might be needed - the "old" one (that so far seems to do the job for most people) and a new one that handles double byte chars but may be slower. Having said that, I'd imagine choice of MessageStore would have a bigger performance impact.

Comment by Steve Bate [ 10/Nov/06 ]

That's basically the approach I've been taking. One issue is that the way the body length and checksum is currently calculated, the specified charset would have to be pushed down to the field level and each field would need to do a character decoding to a byte array to determine it's length in bytes and it's contribution to the checksum. If I do that and I have the conditional to provide the current behavior if the default charset (US-ASCII) is being used, then it might be reasonable.

Comment by Jörg Thönnes [ 21/Mar/07 ]

We have issues with German and Italian character sets. These character sets use
values beyound the normal ASCII range (0..127) for QF/ 1.0.3.

After the log line for the incoming log, the error displayed is:

Mar 21, 2007 12:01:05 PM quickfix.mina.AbstractIoHandler messageReceived
SEVERE: Invalid message: Expected CheckSum=199, Received CheckSum=182

At the moment, we have no way to check automatically for this error. Is there anyway to
use a MINA error handler?

Comment by Jörg Thönnes [ 31/Mar/07 ]

Looking at some CharsetDecoder examples, I am wondering whether setting the encoding/decoding char set to Latin-1
would help in our case.

Comment by Jörg Thönnes [ 31/Mar/07 ]

In the code, I found two places where the char set name "US-ASCII" is used.

To check the conversion of umlauts, I used the following snippet:

//String charSet = "US-ASCII";
//String charSet = "ISO_8859-1";
System.out.println(charSet);
String s = "äbc";
int offset = s.length();
int sum = 0;
for (int i = 0; i < offset; i++) {
int val = s.charAt;
System.out.print(val+"+");
sum+=val;
};
System.out.println( "Checksum: " + sum % 256 );
sum=0;
byte[] bb = s.getBytes( charSet );
sum=0; offset=bb.length;
for (int i = 0; i < offset; i++) {
int val = bb[i];
System.out.print(val+"+");
sum+=val;
}
System.out.println( "Checksum: " + sum % 256 );
System.out.println();

The output is

ISO_8859-1
228+98+99+Checksum: 169
-28+98+99+Checksum: 169

US-ASCII
228+98+99+Checksum: 169
63+98+99+Checksum: 4

So in our case, setting the char set name to "ISO_8859-1" would help.

That is, we need a char set configurable globally or per session.

Comment by Steve Bate [ 02/Apr/07 ]

Is the US-ASCII you saw used in the Multibyte branch? I couldn't find it in the trunk and I don't currently have a branch workspace checked out.

Comment by Jörg Thönnes [ 02/Apr/07 ]

FIXMessageDecoder.java, SVN 555:
public FIXMessageDecoder()

{ this("US-ASCII"); }

FIXMessageEncoder.java, SVN 555:
public FIXMessageDecoder() { this("US-ASCII"); }
Comment by Jörg Thönnes [ 11/Apr/07 ]

This little Java program shows the current default charset:

public class ShowDefaultCharset {
public static void main( String[] args )

{ System.out.println(java.nio.charset.Charset.defaultCharset().toString()); }

}

On Linux, the output depends on the setting of the LANG environment variable:

LANG=POSIX java -cp . ShowDefaultCharset
US-ASCII

LANG=en_US.iso88591 java -cp . ShowDefaultCharset
ISO-8859-1

LANG=de_DE.utf8 java -cp . ShowDefaultCharset
UTF-8

Since the ISO-8859-1 character sets covers the whole 8 bits of a byte, it should work well for most 1 byte non- ASCII charsets.

Comment by Jörg Thönnes [ 11/Apr/07 ]

Here is an extended java program which also checks the byte to character mappings:

public class ShowDefaultCharset {
public static void main( String[] args ) {
System.out.println(java.nio.charset.Charset.defaultCharset().toString());

byte[] b = new byte[256];
for ( int i=0; i<256; i++ )

{ b[i] = (byte)i; }

final String x = new String ( b );
System.out.println( "Looking for non 1:1 byte to character mappings..." );
for ( int i=0; i<x.length(); i++ ) {
if ( x.charAt( i ) != i )

{ System.out.println( i + " -> " + (int)x.charAt ( i ) ); }

}
System.out.println("Done.");
}
}

Applying this program to ISO_8859-1 shows no change, ie this charset is really 1:1.

This means that the default character set for QuickFIX/J should be ISO_8859-1.

For multi-byte character sets, extra effort has to be done.

Comment by Jörg Thönnes [ 11/Apr/07 ]

Another extension checks for the reverse direction. Due to the signed integer, I add 256 modulo 256 to get the positive value:

public class ShowDefaultCharset {
public static void main( String[] args ) {
System.out.println(java.nio.charset.Charset.defaultCharset().toString());

byte[] b = new byte[256];
for ( int i=0; i<256; i++ )

{ b[i] = (byte)i; }

final String x = new String ( b );
final byte[] bx = x.getBytes();
System.out.println( "Looking for non 1:1 byte to character mappings..." );
for ( int i=0; i<x.length(); i++ ) {
if ( x.charAt( i ) != i )

{ System.out.println( i + " -> " + (int)x.charAt ( i ) ); }

final int bb = (bx[i]+256) % 256;
if ( bb != i )

{ System.out.println( i + " <- " + bb ); }

}
System.out.println("Done.");
}
}

For ISO_8859-1, this works fine.

Comment by Jörg Thönnes [ 11/Apr/07 ]

I suggest to make this patch to the 1.1.0 release to make the FIXMessageEncoder equivalent to the FIXMessageDecoder:

Index: /export/home/joerg/workspace/quickfixj/core/src/main/java/quickfix/mina/message/FIXMessageEncoder.java
===================================================================
— /export/home/joerg/workspace/quickfixj/core/src/main/java/quickfix/mina/message/FIXMessageEncoder.java (revision 617)
+++ /export/home/joerg/workspace/quickfixj/core/src/main/java/quickfix/mina/message/FIXMessageEncoder.java (working copy)
@@ -19,6 +19,7 @@

package quickfix.mina.message;

+import java.io.UnsupportedEncodingException;
import java.util.HashSet;
import java.util.Set;

@@ -50,6 +51,8 @@
return TYPES;
}

+ private String charsetName = "ISO_8859-1";
+
public void encode(IoSession session, Object message, ProtocolEncoderOutput out)
throws ProtocolCodecException

{ String fixMessageString; @@ -65,7 +68,11 @@ }

ByteBuffer buffer = ByteBuffer.allocate(fixMessageString.length());

  • buffer.put(fixMessageString.getBytes());
    + try { + buffer.put(fixMessageString.getBytes(charsetName)); + }

    catch (UnsupportedEncodingException e)

    { + throw new ProtocolCodecException( e ); + }

    buffer.flip();
    out.write(buffer);
    }

Steve, what do you think?

Comment by Jörg Thönnes [ 11/Apr/07 ]

The suggested patch would complement the changes made with revision 527 in the FIXMessageDecoder.
For some reason, the equivalent change in the FIXMessageEncoder is missing.

Comment by Jörg Thönnes [ 12/Apr/07 ]

I would also remove the setCharset methods in both FIXMessageEncoder/Decoder since
setting a Charset different from the ISO_8859-1 is not supported at the moment.

To allow other charsets, more work in the validation method has to be done.

Comment by Steve Bate [ 12/Apr/07 ]

I originally was using US-ASCII (in the branch) because that's the FIX specification requires ASCII for nonencoded fields. However, I have no problem supporting other character sets (single byte, for now). Is there any reason why we wouldn't use UTF-8 instead of ISO_8859-1?

Comment by Jörg Thönnes [ 12/Apr/07 ]

Because ISO_8859-1 is the only charset with 1:1 mapping.
Just try my example program above with different LANG settings.

I tried US-ASCII, UTF-8 and ISO_8859-15, and every of these charsets mapping some of the bytes differently.

Therefore, exposing the setCharset() method makes sense as soon as the validate() method computes the checksums on the plain bytes.

My idea is to have a factory (FramingStrategyFactory) which returns a FramingStrategy for a given charset.
This would be the central entry point for the MINA en-/decoders and the validate() message to compute
both the checksum and the message length.

The FramingStrategy for ISO_8859-1 could simply take the String length and operate on the Java String directly, while single-byte strategies could take the String length, but compute the checksum on the bytes and multi-byte strategies also compute the length on the bytes.

The FIX Message constructor would have an optional charset argument, and if the message is sent down the link, a check is made whether the encoder charset is compatible to the Message charset. If not, the checksum and possibly the length are recomputed.

But I would promote this FramingStrategy stuff to a new JIRA issue.

Comment by Jörg Thönnes [ 12/Apr/07 ]

Today I noticed that outgoing message with national characters are encoded wrong on the first attempt,
but are re-coded correctly on resend. Will investigate further...

Comment by Jörg Thönnes [ 19/Apr/07 ]

OK, for outgoing messages, it currently works as follows:

1. Without PossDup=Y: The checksum is computed on the String directly and then forwarded to the encoder. The checksum is wrong.

2. Inside ResendRequest: The String is retrieved from MessageStore, where it has saved as byte[] array. In this way, the "bad" characters seem
to be replaced by "?". The checksum computed in the String domain now matched the checksum if it would have been computed on the bytes.

In summary, non ASCII characters cause exactly one resend round-trip for outgoing messages.

Comment by Steve Bate [ 25/May/07 ]

I checked in changes to allow the message encoding to be set on a JVM-wide basis. See the CharsetSupport and it's uses to see the specific changes. I apologize that I forgot to add the issue tag to the commit so they SVN changes aren't linked to this issue.

Comment by amichair [ 09/Jun/14 ]

Following resolution of QFJ-38 and QFJ-382, UTF-8 should now work as well (or any other charset which is backward-compatible with ASCII).

Generated at Tue May 07 19:00:36 UTC 2024 using JIRA 7.5.2#75007-sha1:9f5725bb824792b3230a5d8716f0c13e296a3cae.