[#QFJ-262] Quickfix resend problems after recovering from network failure

[QFJ-262] Quickfix resend problems after recovering from network failure Created: 30/Oct/07 Updated: 15/Jan/08 Resolved: 12/Jan/08
Status:	Closed
Project:	QuickFIX/J
Component/s:	None
Affects Version/s:	1.2.1, 1.3.0
Fix Version/s:	1.3.1

Type:

Bug

Priority:

Critical

Reporter:

Tim Wright

Assignee:

Unassigned

Resolution:

Fixed

Votes:

Labels:

None

Environment:

Linux Centos 4 \ Sun JDK 1.6_02

Attachments:

FIX Issue.txt

QuickFixLog.txt

Description

We have seen an incident occur twice in our quickfix code where the fix engine seems to send spurious resend requests, and processes FIX messages incorrectly - including handing duplicate messages to the application.

On both occasions, this has happened following a network failure where the FIX engine seemed to behave correctly, recovering missed messages, and correctly handling the re-establishment of the proper sequence numbers. Following (what we assume) is the triggering incident things start to seem to go wrong.

I have attached a file with the full details from out log files which a walk through commentary

Comments

Comment by Tim Wright [ 30/Oct/07 ]

This document contains the log files and full details of the issue I have seen.

Comment by Michael Briganti [ 17/Nov/07 ]

We have experienced the same issue while running with 1.3.0. We have not been able to say for certain whether our network experienced an "interruption" but on at lease on distinct occasion we observed the same "runaway" and unrequited resend request. Any help on this issue, or work-around, would be very useful.

Comment by Michael Briganti [ 21/Nov/07 ]

We attempted to investigate further, but unfortunately we cannot provide a log with any conclusive information. What we can say is that we believe there is a network level issue with the mina jar's found in 1.3.0. After migrating to 1.3.0 our application began dropping messages causing the fix engine to send resend requests. We found this behavior to be intermittent, sporadic, and only manifesting during very high load conditions (algorithmic based trading). Prior to "upgrading" to 1.3.0 our application had run for over six months with quickfix 12.1 without ever "glitching." Since this is a production issue for us, and since we were not successful duplicating the issue in our QA environment our solution was to roll-back to quickfix 1.2.1, afterwhich everything became calm again.

We are submitting only to provide a record should others experience unexplained issues with "dirty sockets."

Comment by Steve Bate [ 21/Nov/07 ]

Thanks for the additional information. Is there something specific making you think it is MINA versus a threading issue?

Comment by Michael Briganti [ 21/Nov/07 ]

Our apologies. We hope we aren't sounding too alarmist by assigning the issue to MINA, and in truth, even if MINA is the culprit, it is most likely a threading issue anyway--albeit for MINA to solve.

Our assessment is based on indirect observation which admittedly, must be held circumspect. What we observed is that we can most definitely confirm from logs that our sending engine (in4Reach) sent sequence numbers 10, 11, 12 to the QuickFix 1.3.0 engine and we received 10, and 12. The engine rightfully requests a resend, but what caused the missed message?

Network disconnect on either side is ruled out-~~both QuickFix 1.3.0 (MINA) and the connecting engine (in4Reach) would have registered a network exception, as in the case of this current issue (~~QFJ-262~~)~~-the reason for the initial thread.

Also, both the connecting engine (in4Reach) and the QuickFix 1.3.1 engine is running internally, on our LAN, indeed on the same subnet. Network topology, and message loss is always a possibility, but not likely since QuickFix 1.2.1 ran for months without ever a single resend request, under similar loads and under the same network topology no less.

Based on our limited analysis, the only issues we have left to investigate is the differences between the QuickFix versions (1.3.1 vs. 1.2.1). When we compared the Session.java modules from the two versions we couldn't be certain, but it appeared that most of the changes are of the "plain vanilla type" and were initiated in order to move from jdk1.4 to jdk1.5 although of course, you would be in a better position then we are to know what exactly is going on in that module. Furthermore, if the issue was at the QuickFix level, we would expect the MINA modules to have logged the incoming messages and for the messages to make it through the MessageDecoders--which to the best that we can tell, is not the case.

We've used MINA for other purposes, and while our experiences have been positive, we have always been concerned about stability of new releases, but this may be a simple prejudice rather then a something to be considered a hard fact.

A little more about our installations. You already know the network topology. From the hardware perspective, we are running on MS Window 2003 multi-processor (2 dual core) server machines. On the QuickFix side we are using the SleepycatStore which initially we had some concerns about because there was some synchronization in the SleepycatStore module that we didn't quite understand. We hypothesized that the SleepycatStore module may have no extensive field experience to speak of and that this module could be mucking up sequence numbers, but again, the QuickFix based MINA modules would have logged the incoming messages, after which things would have gotten a little chaotic within the engine itself.

That is all we have to offer. I wish our report could provide more conclusive evidence with regard to the issues genesis. Of course, if messages are not being reliably delivered, we can expect that this thread would start to load up pretty quickly.

Comment by Tim Wright [ 22/Nov/07 ]

Hi all,

Firstly, I do have to apologize for throwing in a bit of a red herring in. I double checked the versions of quickfix we were running when the incidents occurred and it was 1.2.1 and not 1.3.0. It seems looking for the issue we have experience will not be found in a code change between 1.2.1 and 1.3.0.

We have only ever seen the issue following an "ungraceful fail over" of a switch connected to our firewall (thru which the FIX connection passes).To date, we have not done much investigation into the code since a restart the process is quite painless, the issue is rare and the problematic switch has now been fixed!

Michael, it seems possible you have a different or related issue?

It's doe seems likely that the issue we experience were triggered by a problem in Mina. In on example quickfix reports that a messages was dropped and starts the resend process even though the message is clearly output in the log. It's seems possible that problems in Mina could cause this - but not sure how.

When we get some time, we will try and reproduce the problem and do a little more investigation.

Tim

Comment by Michael Briganti [ 23/Nov/07 ]

We are attaching a file of a recent session running with QuickFix 1.3.0. Our installation receives a message type 35=Q, DK message with sequence number 1285. The message is received and logged to the JDBC logger so we know this message made it to our quickfix 1.3.0 installation. The engine then complains that it hasn't received 1285, apparently claiming it is receiving 1286, even though 1286 does not appear until quite a few milliseconds later (and in the form of a 35=4 the sequence reset protocal). The engine requests a resend, and apparently gets the resend, and we see that no matter what happens, the 1285 message does not get satisfied.

Comment by Michael Briganti [ 23/Nov/07 ]

Incidentally, the last post, and attached logging information, plus some additional investigations we did with the MINA community seems to rule-out MINA as the culprit, at least for the issues we are experiencing.

Comment by Steve Bate [ 24/Nov/07 ]

The log file definitely indicates some strange behavior. My current theory is that it's some type of thread-related issue. The resend processing hasn't changed in quite some time. Just for additional information, what message store and session connector implementations are you using? Have you created custom implementations or modified the QFJ bundled implementations? Again, I'm not saying you're doing anything to cause this problem but I'm trying to eliminate that possibility.

Comment by Michael Briganti [ 24/Nov/07 ]

We are running quite a few QuickFix installations around here--to you and your team credit. Of course, when a "problem" was uncovered in one installation, our development team became hyper-vigilent and that is how the second problem was uncovered. For the particular implementation that coincides with our log attachment, we are using the standard ThreadedSocketAcceptor/Initiators that comes "out of the box."

I'm glad you asked about the message store. Originally I was under the impression that we were using the supplied JdbcStore, but after closer inspection for this implementation we are using a slightly "tweaked" JdbcStore, and for the first implementation, the implementation where the original problem was observed, we are using the SleepycatStore.

In closer inspection of the "tweaked" version of the JdbcStore, I can see we have slightly modified the reset() method, and this very well could be the reason for the issue and the strange behavior exhibited in the log file we supplied. Our reason for modifying reset() escapes me at the moment. For this installation we are planning to revert to the standard JdbcStore and to also instrument that installation with SLF4J engine level logging. That installation is a large one (for us) with over 30 separate connections, some initiators and some acceptors. Generally it behaves very well and should a problem develop, we will have increased visibility within the logs and better assurances that we are using a out-of-box configuration.

Going forward, we still have the SleepycatStore installation to contend with. This installation is used in Algorithmic trading and must contend with large burst of many orders that happen in very quick succession. Like the other installation, this component is using standard ThreadedSocketAcceptor/Initiators and is currently fully instrumented for engine level logging through SLF4J. This installation works fine with 1.2.1 but begins to fritz with the upgrade to 1.3.0. Originally we suspected MINA, but after having discussions on the MINA boards, and upon closer inspections we are backing away from that assessment.

Kindly weigh in on your assessment of the SleepycatStore. We are considering sacrificing some through-put and convert to the JdbcStore for the aforementioned installation since we are concerned that the SleepycatStore may have little or no experience in actual production situations.

In reviewing our contributions to this thread, I am realizing that we have supplied you with a lot of words but no conclusive material that can assist in saying for certain if there is a problem in 1.3.0 or not. We will continue to monitor our installations and will provide additional information as it becomes available to us.

In the meantime thanks for your timely support.

Comment by Steve Bate [ 12/Jan/08 ]

I think this may have been an issue with the way the session state resend queue was being managed. I have made some modifications and done some load testing with lots of corrupted messages, random disconnects, and so on and everything appears to be working well. If you continue to have these problems after 1.3.1, then please reopen the issue.

Generated at Wed May 15 20:41:04 UTC 2024 using JIRA 7.5.2#75007-sha1:9f5725bb824792b3230a5d8716f0c13e296a3cae.

[QFJ-262] Quickfix resend problems after recovering from network failure Created: 30/Oct/07 Updated: 15/Jan/08 Resolved: 12/Jan/08