[QFJ-291] Problems with initiator reconnect will block heartbeats on other sessions Created: 17/Jan/08  Updated: 07/Jan/20  Resolved: 18/Jun/08

Status: Closed
Project: QuickFIX/J
Component/s: Engine
Affects Version/s: 1.2.1, 1.3.0, 1.3.1
Fix Version/s: 1.3.2

Type: Bug Priority: Major
Reporter: Rhys Yarranton Assignee: Unassigned
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Relates
is related to QFJ-555 Problems with heartbeating in one ses... Open

 Description   

The session timer and initiator reconnect tasks run on the same single threaded executor (SessionConnector.scheduledExecutorService). As a result, if there is a problem reconnecting on one session, it will block the heartbeat on other sessions, leading to test requests from the counterparties and, if bad enough, disconnects. We have seen this scenario in production twice and view it as serious. Other similar scenarios are possible.

Suggested fix is to switch to an unbounded pool, i.e.,
scheduledExecutorService = Executors.newScheduledThreadPool(1, new QFTimerThreadFactory());
(There is no obvious reason the tasks cannot execute in parallel.)



 Comments   
Comment by Rhys Yarranton [ 17/Jan/08 ]

One other thing to note. In the event log for the problem initiator, all you'll see is:
xx:xx:xx,xxx INFO [xxxx.Event] FIX.4.2:xxxx->yyyy: null
This is due to IoSessionInitiator.connect() line 136, which shows only the message on the IOException and does not include the stack trace. From a diagnostic point of view, I would suggest logging all exceptions caught in that catch block, and also not unrolling them (lines 132-134). (All line numbers from 1.2.1.)

Comment by Steve Bate [ 17/Jan/08 ]

OK, I understand the issue. I'm not sure I want to use an unbounded thread pool since it may lead to excessive thread construction and destruction, but I'll consider that as a possible solution. I'm thinking that I'll probably modify the ConnectTask to not wait indefinitely for the connection to complete. Instead it will basically periodically poll the ConnectFuture until the connection request completes.

On the exception issue, can you explain more how the stack trace for the IOExceptions would help you? I can understand how adding a prefix to the exception message might help when the IOException has a null message. However, printing the stack traces results in a large amounts of useless stack traces in the log files during normal reconnect scenarios.

Comment by Rhys Yarranton [ 22/Jan/08 ]

That sounds like it should work for the initiator task. The heartbeat task normally shouldn't block for long as the send is asynchronous. otoh, it could include application toAdmin() code over which the engine has no control. There is the option of a bounded pool with bound > 1, which would not fully isolate each session but would allow for a handful of problem sessions.

I'm not sure how to explain on the stack trace thing. It would include the exception type and message for all the messages in the chain, so you would likely get more info than just "null". It would make it easier to find where in the code it came from (I had to go code surfing to find log statements that might output "null"). And often we find useful info by seeing where something died in a third party library as it might point us to a botched config, a known bug or whatever. In this case hopefully it would allow us to say "This is a socket timeout," as opposed to "This might be a socket timeout, or it could be a pause followed by a disconnect or ..."

Personally I would prefer to see stack traces unless the cause is definitely identified. e.g., if you can see the exception is a "connection reset by peer" and want to be brief, that makes sense, but if it's mystery error number five I want to see all the gory details. QF is extremely terse in this regard. e.g., if there is a checksum error you get a warning, but it doesn't show you the problem section of stream. (Have a patch for that, will enter a JIRA.)

Comment by Christoph John [ 07/Jan/20 ]

This will be improved with https://github.com/quickfix-j/quickfixj/issues/254.

Generated at Sun May 05 11:27:05 UTC 2024 using JIRA 7.5.2#75007-sha1:9f5725bb824792b3230a5d8716f0c13e296a3cae.