[QFJ-810] Messages get lost during logon process with sequence gap Created: 06/Oct/14  Updated: 16/Oct/14

Status: Open
Project: QuickFIX/J
Component/s: Engine
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Heribert Steuer Assignee: Unassigned
Resolution: Unresolved Votes: 0
Labels: condition, gap, logon, race

Issue Links:
Relates
is related to QFJ-804 Race condition between sessionClosed ... Closed

 Description   

When a sequence gap exists, all subsequent messages sent by the counterparty are lost when the logon message contains a out-of-sync sequence number.

Scenario: Counterparty A is initiating a logon, counterparty B responds to the logon and immediately sends 10 more messages. A now detects the sequence problem in the login message and closes the session. The 10 messages are lost (meaning they are logged as received but never appear in the callback)

The problem seems to be that the verify() function detects the gap in the logon messages, creates a logout and a disconnect. This leads to the fact that the ThreadPerSessionEventHandlingStrategy.run method is left and its local queue gets destroyed - still containing the messages that have been received after the logon.

Whatever the correct behaviour would be - the current one is a problem because of the messages are received but never processed. It might be better to leave ThreadPerSessionEventHandlingStrategy.run() only if the queue is empty.



 Comments   
Comment by Christoph John [ 08/Oct/14 ]

Just for clarification: the Logon of counterparty B has a sequence number which is too low? In my opinion the messages following the Logon should not be processed since their sequence number was too low. But maybe I did not understand fully.

Comment by Heribert Steuer [ 09/Oct/14 ]

In any case, just dropping messages should never happen. Probably a session level reject should be sent because business level messages appear without the session being completely established. But receiving and simply dropping does not feel good at all because none of the counterparties get a feedback on the problem.

Please share your thoughts.

Comment by Christoph John [ 10/Oct/14 ]

I see a problem with sending Reject messages if the Session is not fully established.
Let me ask again: was the sequence number too low? I guess it must have been too low if the connection has been closed immediately by A. If the sequence number is too low it is considered a serious problem and the connection has to be dropped (preferrably by sending a Logout message if possible). This would require manual intervention in any case.
Maybe the processing has to be changed in a way that the connection is closed immediately and no more messages are accepted until the session has been completely established.
What do you think?

Comment by Heribert Steuer [ 10/Oct/14 ]

Correct, the SeqNum was too low. In fact we had the issue in production where a session went out of sync and we got a periodic Logon/Logout war. This means that both "A" sent a logon, "B" accepts the logon immediately followed by a resend request. "A" then sends a logout because the sequence number of the logon "A" sent earlier did not match. The problem in total seems to be that there is no real handshake in FIX. So you never really know when a session is really established e.g. like in TCP you have SYN, SYN ACK, ACK. In FIX you would - using TCP as an example - have only SYN and SYN ACK. When the responder replies to a logon message, he never knows if the initiator will accept his logon message. Therefore he does not know when he can start to send data. This is a design flaw in FIX, it would be best the have the initiator have something like "established" returned to the acceptor after his logon.

Nevertheless, the problem remains that the acceptor sends data immediately after sending his logon without knowing if he is allowed to do so or if the logon gets rejected somehow. I do not think that closing the connection is a solution. While you are processing the logon in quickfix, Mina would probably already read data from the socket or traffic will arrive in the network stacks buffer etc., so simply closing the connection would not be a real solution. It always ends up in a race condition. In my optinion, the reject would be okay because from the perspective of the acceptor, the session is established. He received a logon and he replied to it. Therefore quickfix is able to respond to it - and it does today by sending a logout. Therefore a solution might be to check if the buffer (ThreadPerSessionEventHandlingStrategy) is empty. If so, send a logout (as it does now). If the buffer is not empty, process the messages and send a session level reject in return. As the buffer is empty, send a logout and close the socket.

Unfortunately I do not really have a better idea how to solve it, as mentioned earlier its more or less a flaw in the design of the protocol. How does the above idea sound to you?

Comment by Christoph John [ 11/Oct/14 ]

Just another question: this issue basically sounds a little like QFJ-790. Do you agree?

However, your solution sounds feasible. Will take a deeper look at it.

As a workaround for the time being: many people implement the handshake as follows: after Logon immediately send a TestRequest. As soon as you receive the Heartbeat message with your TestReqID you can be pretty sure that the session is established. Of course, this is no 100% guarantee but works quite well in most cases.

Generated at Thu May 02 10:08:35 UTC 2024 using JIRA 7.5.2#75007-sha1:9f5725bb824792b3230a5d8716f0c13e296a3cae.