We have seen an incident occur twice in our quickfix code where the fix engine seems to send spurious resend requests, and processes FIX messages incorrectly - including handing duplicate messages to the application. On both occasions, this has happened following a network failure where the FIX engine seemed to behave correctly, recovering missed messages, and correctly handling the re-establishment of the proper sequence numbers. Following (what we assume) is the triggering incident things start to seem to go wrong. The following is a description of what happened, with an extracts from our log files. I have used our post processing utility which pretty prints the log lines from the "SocketConnectorIoProcessor" (i.e. as fix messages are plucked from the wire). First, we see the "original incident" (which was the failure of a switch following a firewall fail over) >> Sent Heartbeat [0] ASPECT >>> VWXYZ @ 20071030-10:12:02.042 (Seq: 1788) >> >> Recieved Test Request [1] VWXYZ >>> ASPECT @ 20071030-10:12:03 (Seq: 2230) >> TestReqID [112] : HeartBtExt Timeout >> From this point our messages to the counterparty are lost... >> Sent Heartbeat [0] ASPECT >>> VWXYZ @ 20071030-10:12:06.209 (Seq: 1789) >> TestReqID [112] : HeartBtExt Timeout >> >> Recieved Heartbeat [0] VWXYZ >>> ASPECT @ 20071030-10:12:33 (Seq: 2231) >> >> [INFO] [2007-10-30 10:12:35,069] [SocketConnectorIoProcessor-0.0] [quickfixj.event] FIX.4.2:ASPECT->VWXYZ: Disconnecting >> [ERROR] [2007-10-30 10:12:35,070] [SocketConnectorIoProcessor-0.0] [raiser.SafeExceptionRaiser] QUICK FIX ADAPTER DOWN >> >> >> [INFO] [2007-10-30 10:13:35,386] [QFJ Timer] [quickfixj.event] FIX.4.2:ASPECT->VWXYZ: null The quick fix adapter connection is down, and it tries to reconnect.... >> [INFO] [2007-10-30 10:14:36,390] [QFJ Timer] [quickfixj.event] FIX.4.2:ASPECT->VWXYZ: null >> [INFO] [2007-10-30 10:15:37,396] [QFJ Timer] [quickfixj.event] FIX.4.2:ASPECT->VWXYZ: null >> [INFO] [2007-10-30 10:16:38,400] [QFJ Timer] [quickfixj.event] FIX.4.2:ASPECT->VWXYZ: null >> [INFO] [2007-10-30 10:17:24,409] [SocketConnectorIoProcessor-0.0] [initiator.InitiatorIoHandler] MINA session created: /10.0.0.89:34065 >> [ERROR] [2007-10-30 10:17:24,409] [QFJ Timer] [connection.QuickfixPhysicalInterface] Session IDs don't match. FIX.4.2:ASPECT->VWXYZ and null >> Sent Logon [A] ASPECT >>> VWXYZ @ 20071030-10:17:24.409 (Seq: 1790) >> HeartBtInt [108] : 30 >> >> [INFO] [2007-10-30 10:17:24,410] [QFJ Timer] [quickfixj.event] FIX.4.2:ASPECT->VWXYZ: Initiated logon request >> [INFO] [2007-10-30 10:17:45,042] [QFJ Timer] [quickfixj.event] FIX.4.2:ASPECT->VWXYZ: Timed out waiting for logon response >> [INFO] [2007-10-30 10:17:45,043] [QFJ Timer] [quickfixj.event] FIX.4.2:ASPECT->VWXYZ: Disconnecting >> [WARN] [2007-10-30 10:17:45,043] [QFJ Timer] [connection.QuickfixPhysicalInterface] FIX logout for unknown session quickfix_vwxyz session VWXYZ >> [INFO] [2007-10-30 10:18:45,456] [QFJ Timer] [quickfixj.event] FIX.4.2:ASPECT->VWXYZ: null The second try works, and we determine a resend is required since 2231 was missed... >> [INFO] [2007-10-30 10:19:07,467] [SocketConnectorIoProcessor-0.0] [initiator.InitiatorIoHandler] MINA session created: /10.0.0.89:34071 >> [ERROR] [2007-10-30 10:19:07,467] [QFJ Timer] [connection.QuickfixPhysicalInterface] Session IDs don't match. FIX.4.2:ASPECT->VWXYZ and null >> Sent Logon [A] ASPECT >>> VWXYZ @ 20071030-10:19:07.467 (Seq: 1791) >> HeartBtInt [108] : 30 >> >> >> [INFO] [2007-10-30 10:19:07,468] [QFJ Timer] [quickfixj.event] FIX.4.2:ASPECT->VWXYZ: Initiated logon request >> Recieved Logon [A] VWXYZ >>> ASPECT @ 20071030-10:19:07 (Seq: 2233) >> HeartBtInt [108] : 30 >> >> [INFO] [2007-10-30 10:19:07,475] [QFJ Message Processor] [quickfixj.event] FIX.4.2:ASPECT->VWXYZ: Received logon response >> [INFO] [2007-10-30 10:19:07,475] [QFJ Message Processor] [quickfixj.event] FIX.4.2:ASPECT->VWXYZ: MsgSeqNum too high, expecting 2232 but received 2233 >> [ERROR] [2007-10-30 10:19:07,477] [QFJ Message Processor] [connection.QuickfixPhysicalInterface] Session IDs don't match. FIX.4.2:ASPECT->VWXYZ and null >> Sent Resend Request [2] ASPECT >>> VWXYZ @ 20071030-10:19:07.477 (Seq: 1792) >> BeginSeqNo [7] : 2232 >> EndSeqNo [16] : 0 >> >> [INFO] [2007-10-30 10:19:07,477] [QFJ Message Processor] [quickfixj.event] FIX.4.2:ASPECT->VWXYZ: Sent ResendRequest FROM: 2232 TO: 0 >> [INFO] [2007-10-30 10:19:07,477] [QFJ Message Processor] [connection.QuickfixPhysicalInterface] FIX logon for quickfix_vwxyz session VWXYZ >> [INFO] [2007-10-30 10:19:07,478] [QFJ Message Processor] [connection.QuickfixPhysicalInterface] FIX logon for VWXYZ being used as session to send to >> >> >> Recieved Execution Report [8] VWXYZ >>> ASPECT @ 20071030-10:19:07 (Seq: 2232) >> ExecType [150] : Fill (Replaced) [2] >> ExecTransType [20] : New [0] >> OrdStatus [39] : Fill (Replaced) [2] >> ClOrdID [11] : f8di214l:0.0 >> OrderID [37] : 20071030004107 >> Symbol [55] : GE >> Side [54] : Buy [1] >> OrderQty [38] : 15 >> Price [44] : 9575 >> OrdType [40] : Limit [2] >> TimeInForce [59] : Day [0] >> LastQty [32] : 15 >> LastPx [31] : 9575 >> ExecID [17] : 37012822 >> MaturityMonthYear [200] : 200806 >> Account [1] : 0000AASP01 >> CumQty [14] : 15 >> LeavesQty [151] : 0 >> OrigSendingTime [122] : 20071030-10:17:45 >> SecurityType [167] : FUT >> AvgPx [6] : 9575.0000000 >> TransactTime [60] : 20071030-10:17:45 >> PossDupFlag [43] : Y >> >> [INFO] [2007-10-30 10:19:07,483] [QFJ Message Processor] [quickfixj.event] FIX.4.2:ASPECT->VWXYZ: ResendRequest for messages FROM: 2232 TO: 2232 has been satisfied. >> [INFO] [2007-10-30 10:19:07,483] [QFJ Message Processor] [impl.QuickfixPhysicalConnectionHolder] Incoming message received >> [INFO] [2007-10-30 10:19:07,483] [QFJ Message Processor] [workunit.SimpleWorkUnitScheduler] Queuing FIX message for processing >> [INFO] [2007-10-30 10:19:07,483] [QFJ Message Processor] [quickfixj.event] FIX.4.2:ASPECT->VWXYZ: Processing QUEUED message: 2233 >> [INFO] [2007-10-30 10:19:07,484] [Executor] [workunit.SimpleWorkUnitScheduler] Starting work unit: 'InboundRequestWorkUnit'. Txn: 'system-txn-1193693129619'. Executor queue size: 0 >> Recieved Sequence Reset [4] VWXYZ >>> ASPECT @ 20071030-10:19:07 (Seq: 2233) >> NewSeqNo [36] : 2234 >> GapFillFlag [123] : Y >> PossDupFlag [43] : Y >> >> >> [INFO] [2007-10-30 10:19:07,484] [Executor] [inbound.ExecutionReportWorkUnit] Processing FILL execution report... (i removed some app loging here) The other side of the connection now requests messages that we failed to send. There was nothing interesting so a gap fill is sent.... >> Recieved Test Request [1] VWXYZ >>> ASPECT @ 20071030-10:19:07 (Seq: 2234) >> TestReqID [112] : synchronized? >> >> Sent Heartbeat [0] ASPECT >>> VWXYZ @ 20071030-10:19:07.527 (Seq: 1793) >> TestReqID [112] : synchronized? >> >> Recieved Resend Request [2] VWXYZ >>> ASPECT @ 20071030-10:19:07 (Seq: 2235) >> BeginSeqNo [7] : 1788 >> EndSeqNo [16] : 0 >> >> [INFO] [2007-10-30 10:19:07,528] [QFJ Message Processor] [quickfixj.event] FIX.4.2:ASPECT->VWXYZ: Received ResendRequest FROM: 1788 TO: 0 >> Sent Sequence Reset [4] ASPECT >>> VWXYZ @ 20071030-10:19:07.529 (Seq: 1788) >> NewSeqNo [36] : 1794 >> GapFillFlag [123] : Y >> OrigSendingTime [122] : 20071030-10:19:07 >> PossDupFlag [43] : Y >> >> [INFO] [2007-10-30 10:19:07,529] [QFJ Message Processor] [quickfixj.event] FIX.4.2:ASPECT->VWXYZ: Sent SequenceReset TO: 1794 >> Recieved Resend Request [2] VWXYZ >>> ASPECT @ 20071030-10:19:07 (Seq: 2236) >> BeginSeqNo [7] : 1788 >> EndSeqNo [16] : 0 >> >> [INFO] [2007-10-30 10:19:07,533] [QFJ Message Processor] [quickfixj.event] FIX.4.2:ASPECT->VWXYZ: Received ResendRequest FROM: 1788 TO: 0 >> Sent Sequence Reset [4] ASPECT >>> VWXYZ @ 20071030-10:19:07.533 (Seq: 1788) >> NewSeqNo [36] : 1794 >> GapFillFlag [123] : Y >> OrigSendingTime [122] : 20071030-10:19:07 >> PossDupFlag [43] : Y >> >> [INFO] [2007-10-30 10:19:07,533] [QFJ Message Processor] [quickfixj.event] FIX.4.2:ASPECT->VWXYZ: Sent SequenceReset TO: 1794 At this point all seems ok. Hartbeats are ticking away and things spring back to life again... >> Recieved Heartbeat [0] VWXYZ >>> ASPECT @ 20071030-10:19:37 (Seq: 2237) >> >> Sent Heartbeat [0] ASPECT >>> VWXYZ @ 20071030-10:19:37.537 (Seq: 1794) >> >> Recieved Heartbeat [0] VWXYZ >>> ASPECT @ 20071030-10:20:07 (Seq: 2238) All goes OK for around 15 minutes, when we get another MsqSeqNum to high message. However this time, the message that it claimed to have missed (2289) was clearly visible in the log: >> Recieved Execution Report [8] VWXYZ >>> ASPECT @ 20071030-10:34:50 (Seq: 2289) >> ExecType [150] : Partial fill (Replaced) [1] >> ExecTransType [20] : New [0] >> OrdStatus [39] : Partial fill (Replaced) [1] >> ClOrdID [11] : f8di2160:0.0 >> OrderID [37] : 20071030003319 >> Symbol [55] : GE >> Side [54] : Buy [1] >> OrderQty [38] : 16 >> Price [44] : 9560.5 >> OrdType [40] : Limit [2] >> TimeInForce [59] : Day [0] >> LastQty [32] : 2 >> LastPx [31] : 9560.5 >> ExecID [17] : 37012881 >> MaturityMonthYear [200] : 200803 >> Account [1] : 0000AASP01 >> CumQty [14] : 9 >> LeavesQty [151] : 7 >> SecurityType [167] : FUT >> AvgPx [6] : 9560.5000000 >> TransactTime [60] : 20071030-10:34:50 >> >> Recieved Execution Report [8] VWXYZ >>> ASPECT @ 20071030-10:34:50 (Seq: 2290) >> ExecType [150] : Partial fill (Replaced) [1] >> ExecTransType [20] : New [0] >> OrdStatus [39] : Partial fill (Replaced) [1] >> ClOrdID [11] : f8di2160:0.0 >> OrderID [37] : 20071030003319 >> Symbol [55] : GE >> Side [54] : Buy [1] >> OrderQty [38] : 16 >> Price [44] : 9560.5 >> OrdType [40] : Limit [2] >> TimeInForce [59] : Day [0] >> LastQty [32] : 2 >> LastPx [31] : 9560.5 >> ExecID [17] : 37012882 >> MaturityMonthYear [200] : 200803 >> Account [1] : 0000AASP01 >> CumQty [14] : 11 >> LeavesQty [151] : 5 >> SecurityType [167] : FUT >> AvgPx [6] : 9560.5000000 >> TransactTime [60] : 20071030-10:34:50 >> >> [INFO] [2007-10-30 10:34:50,383] [QFJ Message Processor] [quickfixj.event] FIX.4.2:ASPECT->VWXYZ: MsgSeqNum too high, expecting 2289 but received 2290 >> [INFO] [2007-10-30 10:34:50,383] [QFJ Message Processor] [impl.QuickfixPhysicalConnectionHolder] Incoming message received >> Sent Resend Request [2] ASPECT >>> VWXYZ @ 20071030-10:34:50.383 (Seq: 1833) >> BeginSeqNo [7] : 2289 >> EndSeqNo [16] : 0 A resend message has been sent for 2289 even though that nessage has been seen. At this point, things start to get even wierder. Firstly, quick fix seems to report that the resend request has been satisfied, even though nothing has been recieved: >> [INFO] [2007-10-30 10:34:50,384] [QFJ Message Processor] [quickfixj.event] FIX.4.2:ASPECT->VWXYZ: Processing QUEUED message: 2290 >> [INFO] [2007-10-30 10:34:50,384] [QFJ Message Processor] [quickfixj.event] FIX.4.2:ASPECT->VWXYZ: Sent ResendRequest FROM: 2289 TO: 0 >> [INFO] [2007-10-30 10:34:50,384] [QFJ Message Processor] [quickfixj.event] FIX.4.2:ASPECT->VWXYZ: ResendRequest for messages FROM: 2289 TO: 2289 has been satisfied. Secondly, our application logic outputs that it has processed a fill: >> [INFO] [2007-10-30 10:34:50,384] [Executor] [inbound.InboundRequestDispatcherImpl] Sending execution: Execution: f8di2160:0 EID: 37012881 2@95.605 CQ:9 C:VWXYZ The CQ: value in the login, indicates that this is the first fill, with a cumlative quantity of 9 (the logic would have raised an exception if this were not the case). The second fill (2290) is then processed by the application too. Finally, we get the response from the counterparty. These seem to be ignored on this occation.... >> Recieved Execution Report [8] VWXYZ >>> ASPECT @ 20071030-10:34:50 (Seq: 2289) >> ExecType [150] : Partial fill (Replaced) [1] >> ExecTransType [20] : New [0] >> OrdStatus [39] : Partial fill (Replaced) [1] >> ClOrdID [11] : f8di2160:0.0 >> OrderID [37] : 20071030003319 >> Symbol [55] : GE >> Side [54] : Buy [1] >> OrderQty [38] : 16 >> Price [44] : 9560.5 >> OrdType [40] : Limit [2] >> TimeInForce [59] : Day [0] >> LastQty [32] : 2 >> LastPx [31] : 9560.5 >> ExecID [17] : 37012881 >> MaturityMonthYear [200] : 200803 >> Account [1] : 0000AASP01 >> CumQty [14] : 9 >> LeavesQty [151] : 7 >> OrigSendingTime [122] : 20071030-10:34:50 >> SecurityType [167] : FUT >> AvgPx [6] : 9560.5000000 >> TransactTime [60] : 20071030-10:34:50 >> PossDupFlag [43] : Y >> >> Recieved Execution Report [8] VWXYZ >>> ASPECT @ 20071030-10:34:50 (Seq: 2290) >> ExecType [150] : Partial fill (Replaced) [1] >> ExecTransType [20] : New [0] >> OrdStatus [39] : Partial fill (Replaced) [1] >> ClOrdID [11] : f8di2160:0.0 >> OrderID [37] : 20071030003319 >> Symbol [55] : GE >> Side [54] : Buy [1] >> OrderQty [38] : 16 >> Price [44] : 9560.5 >> OrdType [40] : Limit [2] >> TimeInForce [59] : Day [0] >> LastQty [32] : 2 >> LastPx [31] : 9560.5 >> ExecID [17] : 37012882 >> MaturityMonthYear [200] : 200803 >> Account [1] : 0000AASP01 >> CumQty [14] : 11 >> LeavesQty [151] : 5 >> OrigSendingTime [122] : 20071030-10:34:50 >> SecurityType [167] : FUT >> AvgPx [6] : 9560.5000000 >> TransactTime [60] : 20071030-10:34:50 >> PossDupFlag [43] : Y >> >> Recieved Execution Report [8] VWXYZ >>> ASPECT @ 20071030-10:34:50 (Seq: 2291) >> ExecType [150] : Partial fill (Replaced) [1] >> ExecTransType [20] : New [0] >> OrdStatus [39] : Partial fill (Replaced) [1] >> ClOrdID [11] : f8di2160:0.0 >> OrderID [37] : 20071030003319 >> Symbol [55] : GE >> Side [54] : Buy [1] >> OrderQty [38] : 16 >> Price [44] : 9560.5 >> OrdType [40] : Limit [2] >> TimeInForce [59] : Day [0] >> LastQty [32] : 2 >> LastPx [31] : 9560.5 >> ExecID [17] : 37012883 >> MaturityMonthYear [200] : 200803 >> Account [1] : 0000AASP01 >> CumQty [14] : 13 >> LeavesQty [151] : 3 >> OrigSendingTime [122] : 20071030-10:34:50 >> SecurityType [167] : FUT >> AvgPx [6] : 9560.5000000 >> TransactTime [60] : 20071030-10:34:50 >> PossDupFlag [43] : Y >> >> Recieved Test Request [1] VWXYZ >>> ASPECT @ 20071030-10:34:50 (Seq: 2292) >> TestReqID [112] : synchronized? >> >> Sent Heartbeat [0] ASPECT >>> VWXYZ @ 20071030-10:34:50.395 (Seq: 1834) >> TestReqID [112] : synchronized? This continues to behave very strangly around 15 mintes later when the same thing happens again, but with a twist... >> Recieved Execution Report [8] VWXYZ >>> ASPECT @ 20071030-10:44:26 (Seq: 2350) >> ExecType [150] : New [0] >> ExecTransType [20] : New [0] >> OrdStatus [39] : New [0] >> ClOrdID [11] : f8di216j:6.0 >> OrderID [37] : 20071030001651 >> Symbol [55] : HG >> Side [54] : Buy [1] >> OrderQty [38] : 1 >> Price [44] : 34710 >> OrdType [40] : Limit [2] >> TimeInForce [59] : Day [0] >> LastQty [32] : 0 >> LastPx [31] : 0 >> ExecID [17] : 37012937 >> MaturityMonthYear [200] : 200712 >> Account [1] : 0000AASP01 >> CumQty [14] : 0 >> LeavesQty [151] : 1 >> SecurityType [167] : FUT >> AvgPx [6] : 0 >> TransactTime [60] : 20071030-10:44:26 >> >> Recieved Execution Report [8] VWXYZ >>> ASPECT @ 20071030-10:44:26 (Seq: 2351) >> ExecType [150] : Fill (Replaced) [2] >> ExecTransType [20] : New [0] >> OrdStatus [39] : Fill (Replaced) [2] >> ClOrdID [11] : f8di216j:6.0 >> OrderID [37] : 20071030001651 >> Symbol [55] : HG >> Side [54] : Buy [1] >> OrderQty [38] : 1 >> Price [44] : 34710 >> OrdType [40] : Limit [2] >> TimeInForce [59] : Day [0] >> LastQty [32] : 1 >> LastPx [31] : 34710 >> ExecID [17] : 37012938 >> MaturityMonthYear [200] : 200712 >> Account [1] : 0000AASP01 >> CumQty [14] : 1 >> LeavesQty [151] : 0 >> SecurityType [167] : FUT >> AvgPx [6] : 34710.0000000 >> TransactTime [60] : 20071030-10:44:26 >> >> [INFO] [2007-10-30 10:44:26,793] [QFJ Message Processor] [impl.QuickfixPhysicalConnectionHolder] Incoming message received >> [INFO] [2007-10-30 10:44:26,793] [QFJ Message Processor] [quickfixj.event] FIX.4.2:ASPECT->VWXYZ: MsgSeqNum too high, expecting 2350 but received 2351 >> [INFO] [2007-10-30 10:44:26,793] [QFJ Message Processor] [workunit.SimpleWorkUnitScheduler] Added work unit 'InboundRequestWorkUnit'. Txn: 'system-txn-1193693129718'. >> Executor queue size: 1 >> [INFO] [2007-10-30 10:44:26,793] [Executor] [workunit.SimpleWorkUnitScheduler] Starting work unit: 'InboundRequestWorkUnit'. Txn: 'system-txn-1193693129718'. Executor >> queue size: 0 >> [INFO] [2007-10-30 10:44:26,793] [QFJ Message Processor] [quickfixj.event] FIX.4.2:ASPECT->VWXYZ: Processing QUEUED message: 2351 >> Sent Resend Request [2] ASPECT >>> VWXYZ @ 20071030-10:44:26.793 (Seq: 1872) >> BeginSeqNo [7] : 2351 >> EndSeqNo [16] : 0 This time the sequence reset requsts is odd since despite having missing 2350, the resend request is from 2351. The same thing continues to happen every few minutes until we restart the process. The next time this occurs, it happens twice within a very short period, and our counterparty logs the connection off. Even after the reconnection it continues not to work properly. The above log extract is directly from our application. We have removed some log entries to minimise the logging shown. To give some detail on this process, inbound messages are queued on an internal executor queue for processing (you can see the details above). This is part of an adapter process which translates FIX messages into internal application messages that are posted to another thread for processing.