[#QFJ-999] Issue with infinite replay request

[QFJ-999] Issue with infinite replay request Created: 03/Feb/21 Updated: 07/Feb/21 Resolved: 07/Feb/21
Status:	Closed
Project:	QuickFIX/J
Component/s:	None
Affects Version/s:	2.1.0
Fix Version/s:	None

Type:

Bug

Priority:

Default

Reporter:

Dan Marques

Assignee:

Christoph John

Resolution:

Not a bug

Votes:

Labels:

Reconnect, resend

Environment:

Java 8 / Linux

Attachments:

issue-with-gw-restart-test.txt

Description

I believe I've seen evidence of some sort of issue with infinite replay requests.

I've attached an the annotated log of the situation. The log is from "EXCHANGE", which is a QFJ 2.1.0 application; the other party of the session, "BROKER_GATEWAY", is not QFJ. Both applications are on the same host, so the message time stamps correlate.

Essentially, what occurs in this scenario is that there is a disconnect - and while the FIX session is disconnected, EXCHANGE generates 3 execution reports (er...10, er...11, and er...12). When the session reconnects, EXCHANGE sends a logon with a sqn 12 that is greater than what BROKER_GATEWAY was expected (it was expecting sqn 11), and follows with sqn 13 (er...11) and sqn 14 (er...12)

While EXCHANGE is sending those "queued" messages, BROKER_GATEWAY issues a resend-request, for messages starting at sqn 11 and with an infinite length.

EXCHANGE responds by replaying sqns 11 (er...10), 12 (admin msg as a gap-fill), and 13 (er...11), but never replays 14 (er...12) It isn't until 30 seconds later, that another message, a heartbeat, from EXCHANGE with sqn 15 triggers a second replay request, and 14 (er...12) is replayed.

It seems as if there is some sort of race, where the resend request arrives after sqn 14 was "triggered" to be sent, but before the internal state used to prepare a resent request has been updated.

Comments

Comment by Christoph John [ 05/Feb/21 ]

Thanks for the detailed and annotated log. I will try to create a unit test with this.

Comment by Christoph John [ 05/Feb/21 ]

Please correct me if I get it wrong but here is my interpretation of the log. To me everything looks correct from QFJ perspective.

EXCHANGE gets the ResendRequest after sending out seqno 13, so seqno 14 is a message not affected by resend processing. What BROKER_EXCHANGE side should do is queue any messages that they cannot yet process due to seqno-too-high issues. After the resend processing is complete, the other side should process the queued messages.

In my opinion only report with seqno 11 from the EXCHANGE was generated during disconnection. If seqno 13 and 14 were also resent from the message store (where they reside when there is a disconnection) they would get sent with 43/PossDup and 122/OrigSendingTime as a result from a ResendRequest.
Moreover, you write that EXCHANGE has seqno 10 before the disconnection and sends 12 on Logon. So only seqno 11 could have been created during disconnection.
Could it be that the reports that originated from seqno 13 and 14 were created shortly after or at the moment the FIX session came back up? They have no TransactTime, so I cannot tell.

What do you think?

Comment by Dan Marques [ 05/Feb/21 ]

I think you are correct about the timing of 13 and 14 (being generated after the reconnect), but I don't think the log shows QFJ behaving correctly.

It seems to me that the correct behavior for QFJ, upon receiving a resend request from 11 to infinity, would be to satisfy that request before sending any new messages. So, in our example, if QFJ got the request before sending 14, it should replay 11, 12 and 13, and then send 14. If on the other hand, it got the request after sending 14, it should have replayed 11, 12, 13, and 14.

What we see instead, is it sending 14, following by a replay of 11, 12, and 13. To me, this looks like some sort of race condition, where 14 is "being sent' right as the request comes in - and there is race between the sending code updating the "highest-sqn-counter", and the replay response code reading that value.

EXCHANGE gets the ResendRequest after sending out seqno 13, so seqno 14 is a message not affected by resend processing. What BROKER_EXCHANGE side should do is queue any messages that they cannot yet process due to seqno-too-high issues. After the resend processing is complete, the other side should process the queued messages.

While queuing 14 would "hide" this bug, it seems like QFJ is imposing behavior on the receiving application which isn't mandated by the FIX 4.2 spec. In fact, it seem to run contrary to the following (FIX 4.2 with Errata 20010501, page 30) :

Note: it is imperative that the receiving application process messages in sequence order, e.g. if
message number 7 is missed and 8-9 received, the application should ignore 8 and 9 and ask for a
resend of 7-9, or, preferably, 7-0 (0 represents infinity). This latter approach is strongly recommended
to recover from out of sequence conditions as it allows for faster recovery in the presence of certain
race conditions when both sides are simultaneously attempting to recover a gap.

The quote above says nothing about queuing 8 and 9, and instead says that those messages should be "ignored" and either explicitly (7 - 9) or implicitly (7 - 0) requested for a resend. As the receiver is going to process messages one at a time, when it determines 7 is missing (upon receiving 8), it will not have any knowledge of 9 - it doesn't know if 9 has already been read from the (local) socket, on the wire, in the sender's message store, or yet to be generated by the sender application.

This is completely analogous to our example: 11 is missing, and 12-14 are received. The receiver only knows that 11 is missing when it receives 12, it knows nothing yet about 13 and 14. As soon as that gap is detected, it asks for a resend, from 12 - infinity, as per the recommended approach.

From you description (i.e. if there isn't a bug), it seems that QFJ treats 13 differently from 14, it resends 13 but not 14, based on when it gets the resend-request. However, it impossible for the receiver to know that QFJ is going to treat those differently - when it requested the resend (after receiving 12) it there is no way for it to know that (from QFJ's point of view) 13 had been sent but 14 hadn't. For the receiver to correctly handle QFJs behavior it seems that it would need to queue both 13 and 14. That seems clearly outside the behavior recommended by the spec.

Comment by Christoph John [ 06/Feb/21 ]

Let me first comment on your statements:

It seems to me that the correct behavior for QFJ, upon receiving a resend request from 11 to infinity, would be to satisfy that request before sending any new messages.

Why should QFJ do that? On large message backlogs this could lead to QFJ only satisfying the ResendRequest and queueing all current messages. Instead, new messages could be interleaved with old ones. Of course, the receiving side would need to queue the newer messages until the gap is filled. But this is common practice (see further below in my comment).

So, in our example, if QFJ got the request before sending 14, it should replay 11, 12 and 13, and then send 14. If on the other hand, it got the request after sending 14, it should have replayed 11, 12, 13, and 14.

This will not work in situations when new messages are generated "at the same time" and concurrent ResendRequest processing is going on.
Upon reception of a ResendRequest, QFJ will simply determine the current outgoing sequence number and resend up to that message. You can imagine that the current outgoing sequence number is changing rapidly when sending out multiple messages after each other. There is no need to impose any kind of locking on sending out messages just because a ResendRequest is received.

Now to the specification part:
Firstly, In my opinion the quote from the spec "if message number 7 is missed and 8-9 received, the application should ignore 8 and 9" one should read "ignore" as "not process at the moment".

And although we are talking about a FIX4.2 connection, the session-level specification has since been refined and clarified. That is why you should always check up session-level stuff in the most current spec. However, even the relevant part I mentioned is in the FIX4.2 spec. Please see page 18 (or search for "queue" )

It is also recommended that an engine should store out of sequence messages in a temporary queue and process them in order when the gap is closed. This prevents generating resend requests for n->m, n->m+1, n->m+2, ... which can result in many resent PossDupFlag=Y messages.

By the way, there also is a FIX session state matrix which lists various states of the connection. However, it is not contained in the FIX4.2 spec. IIRC it was in the FIX4.4 spec. Or even better, you can check it online here: https://www.fixtrading.org/standards/fix-session-layer-online/#fix-session-state-matrix

Relevant parts are 11 and 12:

11 Receive MsgSeqNum(34) Too High -
Receive too high of MsgSeqNum(34) from counterparty, queue message, and send ResendRequest(35=2).

12 Awaiting/ Processing Response to ResendRequest(35=2) -Process requested MsgSeqNum(34) with PossDupFlag(43)=Y resent messages and/or SequenceReset(35=4) gap fill messages from counterparty. Queue incoming messages with MsgSeqNum(34) too high.

conclusion
I think the other FIX engine should be corrected to handle this scenario. However, it seems in the worst case it will resynchronize after the next received message. So it might not be worth the trouble (depending on the effort needed). Are you able/allowed to tell which FIX engine this is? (would be handy for my internal list of specialties of FIX engines when analyzing problems with counterparties)

By the way, you might want to look at using NextExpectedMsgSeqNum on Logon (if your counterparty supports it), since this will eliminate the additional ResendRequest roundtrip on Logon and hence also the "racy" conditions. However, as I just realized it is only available from FIX4.4 and up.

If you still have questions please feel free to ask.

Cheers,
Chris.

Comment by Dan Marques [ 07/Feb/21 ]

Chris,

Thanks for your response.

As to your point:

In my opinion the quote from the spec "if message number 7 is missed and 8-9 received, the application should ignore 8 and 9" one should read "ignore" as "not process at the moment".

If your interpretation was what the spec's authors intended, I don't see why they would explicitly state that the receiver should "ask for a resend of 7-9". Under your interpretation, it would need a resend of 7 only, as it could then consume 8 and 9 from the queue.

As to the portion of the spec you referenced:

It is also recommended that an engine should store out of sequence messages in a temporary queue and process them in order when the gap is closed.

That behavior is explicitly described as "recommended", as opposed to "required" or "mandatory" or some similar language. Therefore an application that does not store those message can still be correct.

That said, the second sentence of that quote is illustrative.

This prevents generating resend requests for n->m, n->m+1, n->m+2, ... which can result in many resent PossDupFlag=Y messages.

What I now think is, if QFJ's behavior is correct with regard to the spec, a connecting application which chose not to store out of sequence messages would need to issue multiple, overlapping, resend requests, until it was caught up. For example, in my original log, upon receiving sqn 14, the receiver application should have issued a second request for 11 - 0. As that would have arrived at the QFJ sender after 14 was sent, it would have initiated a second replay, this time from 11- 14. As per that quote, messages 11, 12 and 13 would have been resent twice - safe, but inefficient.

Again, thanks for your time.

Dan

Comment by Christoph John [ 07/Feb/21 ]

Hi Dan,

the reason why a ResendRequest for a single message should be avoided is explained in the next sentence in the spec:

This latter approach is strongly recommended to recover from out of sequence conditions as it allows for faster recovery in the presence of certain race conditions when both sides are simultaneously attempting to recover a gap.

So yes, in this part the spec does not explicitly state that the message should be queued but rather focuses on how to avoid race conditions between two parties which both are doing gap fills.

You are right that the other FIX engine still works in this case. As I said the other FIX engine should be corrected and not must be corrected. And as I also said the situation corrects itself after the next message.

Do you think that this issue can be closed?

Thanks,
Chris.

Comment by Dan Marques [ 07/Feb/21 ]

Chris,

You can close it.

Thanks.

Dan

Generated at Fri May 10 16:41:37 UTC 2024 using JIRA 7.5.2#75007-sha1:9f5725bb824792b3230a5d8716f0c13e296a3cae.

[QFJ-999] Issue with infinite replay request Created: 03/Feb/21 Updated: 07/Feb/21 Resolved: 07/Feb/21