[QFJ-849] SocketInitiator does not stop properly Created: 10/Jun/15  Updated: 17/Jul/17  Resolved: 22/Mar/16

Status: Closed
Project: QuickFIX/J
Component/s: Engine
Affects Version/s: 1.6.0
Fix Version/s: 1.6.2

Type: Bug Priority: Critical
Reporter: Or Ming Chun Assignee: Christoph John
Resolution: Fixed Votes: 4
Labels: QuickfixJ, session
Environment:

Tested on Mac & Ubuntu


Attachments: Text File jstack-162-SNAPSHOT-20160312.txt     Text File quickfixJ_1.6.1_TD.txt     Text File socket.initiator.stop(false).txt     Text File socket.initiator.stop(true)-original-before-and-after.txt     Text File socket.initiator.stop(true)-original.txt     Text File socket.initiator.stop(true).txt    
Issue Links:
Duplicate
is duplicated by QFJ-907 Dead sockets are not closed when Sock... Closed

 Description   

When creating a simple Quickfixj Initiator app and after the session has successfully been created and Logon, I called initiator.stop() and exit the program. But the JVM hang and cannot exit after ~60s



 Comments   
Comment by Christoph John [ 11/Jun/15 ]

Two things:

  1. what happens if you call stop(true)
  2. could you create a stack dump when it is hanging?
    Thanks
Comment by GARY LO [ 11/Jun/15 ]

Hi Christoph John,

1. Same as stop(false), it will hangs about 60s (with no logging) before exit

2. Both stop(false) & stop(true) thread dump is attach during the hangs period

Comment by Christoph John [ 11/Jun/15 ]

Hmm, how do you exit the program? There is no trace of any QFJ-related code in the thread dump? For reference, could you create a thread dump while the program is running?

Comment by GARY LO [ 11/Jun/15 ]

Sorry that I provided wrong log that used my customized quickfixJ version that try to shun down all the executor service when I can initiator.stop()

Now, fall back to 1.6.0 original version, problem still exist

Comment by Christoph John [ 11/Jun/15 ]

Still can't see nothing QFJ-related in the dump, not even user code. Only Java classes. Could you please provide a dump while the process is running?

Comment by GARY LO [ 11/Jun/15 ]

I am using "jstack -l [pid]" to get the thread dump, is that what you want?

added thread dump of before & after initiator.stop(true)

Comment by Christoph John [ 12/Jun/15 ]

Yes, jstack is correct.
In your last dump I can still see the QFJ Timer thread being active. I don't know why it wasn't in the first dumps, though.

If your program does exit after 60 seconds it seems that it waits for the thread in the ScheduledThreadPoolExecutor to die. But actually the QFJ Timer thread is a daemon thread and should not keep the JVM from shutting down.
How do you exit your program?

Could you try with a newer / older / just another JVM version? I have not encountered this problem yet.

Comment by Or Ming Chun [ 12/Jun/15 ]

we are working or Java 1.8

$ java -version
java version "1.8.0_45"
Java(TM) SE Runtime Environment (build 1.8.0_45-b14)
Java HotSpot(TM) 64-Bit Server VM (build 25.45-b02, mixed mode)

is this information helpful?

Comment by Christoph John [ 12/Jun/15 ]

How do you exit your program?

Could you try with a newer / older / just another JVM version?

Comment by Or Ming Chun [ 12/Jun/15 ]

does it matter on what jvm version I use to build the quickfixj jar? I am using Java 1.8 too build quickfixj too

Comment by Christoph John [ 12/Jun/15 ]

I don't think that it matters. IMHO it could be dependent on the JVM that is used to run the program. It is only a guess since I never saw this error. But it could be also dependent on other things in your code/environment/etc. But I assume it is a simple program that is run on the command line and not something that is running in Spring, an app server, an OSGi environment or the like??!

Comment by Rohit [ 01/Sep/15 ]

We see the same thing after we upgraded to QuickFix 1.6.1.
The call to initiator.stop hangs. Meanwhile , we see heartbeats are still sent out from the QuickfixJ engine.
I am attaching a thread dump which I hope helps.
The java version -
java version "1.7.0_80"
Java(TM) SE Runtime Environment (build 1.7.0_80-b15)
Java HotSpot(TM) 64-Bit Server VM (build 24.80-b11, mixed mode)

Comment by Steve Nardone [ 03/Sep/15 ]

I was able to resolve the problem by adding this statement to IoSessionInitiator.stop():

synchronized void stop() {
if (reconnectFuture != null)

{ reconnectFuture.cancel(true); reconnectFuture = null; }

// Shutdown executor created by ioConnector (indirectly).
reconnectTask.ioConnector.dispose();
// ^^^
}

This disposes the executor created by the default constructor of NioSocketConnector (mina) instantiated in the IoConnector constructor. That executor (ExecutorService.newCachedThreadPool) has a 60s keepAliveTime. During shutdown that thread is blocked waiting until the next interval to check for more tasks.

Comment by Martin Vrábel [ 09/Sep/15 ]

Also spotted this issue both in version 1.6.1 and 1.7.0-SNAPSHOT

Comment by Rohit [ 25/Sep/15 ]

I was able to workaround this by avoiding the call to SocketInitiator.Stop()
I am logging out all the sessions manually and setting the initiator to null so it can be garbage collected

This extends to SocketAcceptors as well


 		if (socketInitiator != null) {
-			socketInitiator.stop();
+			Iterator<SessionID> sessionIds = socketInitiator.getSessions().iterator();
+			while (sessionIds.hasNext()) {
+				SessionID sessionId = sessionIds.next();
+				Session.lookupSession(sessionId).logout("user requested");
+			}
+			while (socketInitiator.isLoggedOn()) {
+				try {
+					Thread.sleep(500);
+				} catch (InterruptedException e) {
+					//Obfuscated
+				}
+			}
+			acceptor = null;
 		}
-		//Obfuscated
+		//Obfuscated
+			//Obfuscated
+		}
 		//ObfuscatedApplication = null;

Comment by Christoph John [ 25/Sep/15 ]

Hi, what about Steve Nardone 's solution a few comments above? Looks cleaner to me. Although I have not tested it yet.

Comment by Rohit [ 25/Sep/15 ]

I haven't tried that out yet. I am pulling the QuickFixJ 1.6.1 dependency straight out of the maven repository and not building my own .
I will give it a shot and see if it works. Would avoid an ugly workaround.

Comment by Christoph John [ 29/Dec/15 ]

Hi Steve Nardone,
I have integrated your suggested solution into the 1.6.x as well as the 1.7.0 branch.

Rohit, Martin Vrábel: could you test if that works for you?

Thanks,
Chris.

Comment by Colin DuPlantis [ 12/Mar/16 ]

We have encountered this issue, too, in 1.6.1, but only on machines where the "is_production_machine" flag is set to true. We have not been able to reproduce the problem outside of production. The production machine is a 16-core RedHat distro based on 2.6. Our test machines tend to be 3.x based on other distros, and no more than 8 cores, generally 2 or 4. Not sure if this is relevant in the reproduction or not.

We built and deployed the latest 1.6.2-SNAPSHOT to our (Marketcetera) repo, which has this fix included, but this did not solve the problem.

The problem seems to occur when you're trying to shutdown the initiator while it's busy receiving messages. I've uploaded a relevant jstack output that shows the problem in 1.6.2-SNAPSHOT.

Comment by Christoph John [ 14/Mar/16 ]

Hi Colin, thanks for the stack trace. It helps a lot. It seems that this is a similar issue as described in this comment: http://www.quickfixj.org/jira/browse/QFJ-854?focusedCommentId=12737&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-12737
It is tried to access the EventHandlingStrategy while it is still locked. I think we can access the strategy outside of the lock and put a END_OF_STREAM message into it. That should resolve the "deadlock".

Edit: I will try to integrate this into 1.6.2-SNAPSHOT in the next days and let you know.

Comment by Colin DuPlantis [ 14/Mar/16 ]

It looks like we were able to work around the problem by manually disconnecting each session in the initiator before stopping it.

Comment by Christoph John [ 16/Mar/16 ]

I have made the promised changes. Could you please check if it resolves your problem?
Thanks, Chris.

Comment by Colin DuPlantis [ 16/Mar/16 ]

The problem wasn't 100% reproducible. With that caveat, we were unable to reproduce the problem at all with the changes.

Comment by Christoph John [ 22/Mar/16 ]

Fair enough. Please feel free to report if problem appears again.

Generated at Sun May 05 15:30:26 UTC 2024 using JIRA 7.5.2#75007-sha1:9f5725bb824792b3230a5d8716f0c13e296a3cae.