[QFJ-645] Deadlock between message sending and reset due to message reception on Session and senderMsgSeqNumLock Created: 27/Oct/11  Updated: 08/Apr/14  Resolved: 31/Jul/12

Status: Resolved
Project: QuickFIX/J
Component/s: Engine
Affects Version/s: 1.5.0, 1.5.1
Fix Version/s: 1.5.3

Type: Bug Priority: Critical
Reporter: marc sperisen Assignee: Christoph John
Resolution: Fixed Votes: 2
Labels: deadlock

Attachments: Java Source File SessionResetTest.java    
Issue Links:
Duplicate
is duplicated by QFJ-700 QFJ Timer and QFJ message processor t... Closed
is duplicated by QFJ-571 QFJ Timer - JDBCStore reset issues Closed

 Description   
Deadlock arises between the following threads:
 
Thread1:
"GatewayStateTimer" daemon prio=3 tid=0x04d65388 nid=0x2bef waiting for monitor entry [0x790ff000..0x790ff9f0]
        at quickfix.SessionState.isLogonSent(SessionState.java:162)
        - waiting to lock <0x9d12c020> (a quickfix.Session)
        at quickfix.Session.sentLogon(Session.java:690)
        at quickfix.Session.isLoggedOn(Session.java:731)
        at quickfix.Session.sendRaw(Session.java:2151)
        at quickfix.Session.send(Session.java:2213)
        at quickfix.Session.sendToTarget(Session.java:590)
        at com.greatBank.application.fix.Engine.sendMessage(Engine.java:54)
        at com.greatBank.application.fix.Engine.pingFIX(Engine.java:191)
        at com.greatBank.application.fix.Engine.access$0(Engine.java:186)
        at com.greatBank.application.fix.Engine$PingTimerTask.run(Engine.java:176)
        at java.util.TimerThread.mainLoop(Timer.java:512)
        at java.util.TimerThread.run(Timer.java:462)
 
-> Thread1 is sending a message: it first acquires the lock senderMsgSeqNumLock and then tries to acquire a lock on the session instance
 
 
Thread2:
"QF/J Session dispatcher: FIX.4.2:SENDER1->TARGET1" daemon prio=3 tid=0x014e5ac8 nid=0x2bfc waiting on condition [0x794ff000..0x794ff970]
        at sun.misc.Unsafe.park(Native Method)
        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:118)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:716)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:746)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1076)
        at java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:184)
        at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:256)
        at quickfix.SessionState.lockSenderMsgSeqNum(SessionState.java:322)
        at quickfix.Session.sendRaw(Session.java:2117)
        at quickfix.Session.generateLogout(Session.java:1293)
        at quickfix.Session.generateLogout(Session.java:1284)
        at quickfix.Session.reset(Session.java:756)
        - locked <0x9d12c020> (a quickfix.Session)
        at quickfix.Session.next(Session.java:836)
        at quickfix.mina.ThreadPerSessionEventHandlingStrategy$MessageDispatchingThread.run(ThreadPerSessionEventHandlingStrategy.java:119)
 
-> Thread2: while processing an incoming message, the engine notices that it is in a new session time interval (StartTime/EndTime) and therefore a MsqSeqNum reset is started.
The reset method acquires a lock on the session and, when generating a Logout, tries to acquire the lock senderMsgSeqNumLock, and then deadlocks with Thread1

 Comments   
Comment by Peter Clarke [ 12/Jan/12 ]

Hi
I've also seen this issue with our application.

My solution to this was to remove the synchronized keyword from the Session.reset function.

I've also attached a unit test we have to check for the issue.

Pete

Comment by John [ 17/Jan/12 ]

had this one too, but I haven't changed the library so I synchronized on the "Session" (=> so locks are taken in the same order).

Comment by John [ 02/Feb/12 ]

Erratum on the preceding comment.
Synchronizing "send" on the session created a huge performance hit when having heavy load.

Peter's was solution was the best one from the beggining and my workarround is not that good ;o)

I just put the code I used for the "reset" method :

quickfix.Session
public void reset() throws IOException {
  if (!resetting.compareAndSet(false, true)) {
    return;
  }
  try {
    if (hasResponder()) {
      if (application != null && application instanceof ApplicationExtended) {
        ((ApplicationExtended) application).onBeforeSessionReset(sessionID);
      }
      generateLogout();
      disconnect("Session reset", false);
    }
    resetState();
  }
  finally {
    resetting.set(false);
  }
}
Comment by Christoph John [ 31/Jul/12 ]

Peter and John,
thanks for your contributions. I have changed the reset() method accordingly and have added the UnitTest (I changed it slightly to work with a pausable Executor and made it detect the deadlock programmatically).

Generated at Sat Sep 20 11:57:42 CEST 2014 using JIRA 6.2.1#6256-sha1:89f5b3f8f84d40330af33e974b1b83cd9a6871b1.