QuickFIX/J
  1. QuickFIX/J
  2. QFJ-220

QFJ hangs in IoSessionResponder.disconnect() while waiting for scheduled messages to be written

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Default Default
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.3.0
    • Component/s: Engine
    • Labels:
      None
    • Environment:
      HEAD revision of QFJ (actually, rev 710)

      Description

      This is a continuation of QFJ-215.
      I'm now seeing a new hang that happens in IoSessionResponder.disconnect() on the waitForScheduleMessagesToBeWritten() line.

      Seems that one of the connecting sockets gets the underlying MINA framework in a "bad state" and the IoSessionResonder.waitForScheduleMessagesToBeWritten() function never gets out of the loop that's waiting for the messages to be written.

      I have 2 acceptor sessions running with MINA 1.1.0, and both end up in this state after a few days of being up.
      See the attached logs for the full stack dump, and the output of netstat command (on Ubuntu Linux).

      The relevant portion of the stack trace is:
      "SocketAcceptorIoProcessor-0.0" prio=1 tid=0x082427c0 nid=0x546e sleeping[0xb0c71000..0xb0c71f60]
              at java.lang.Thread.sleep(Native Method)
              at quickfix.mina.IoSessionResponder.waitForScheduleMessagesToBeWritten(IoSessionResponder.java:55)
              at quickfix.mina.IoSessionResponder.disconnect(IoSessionResponder.java:43)
              at quickfix.Session.disconnect(Session.java:1370)
              at quickfix.mina.AbstractIoHandler.exceptionCaught(AbstractIoHandler.java:82)


      Seems that the waitForScheduledMessagesToBeWritten() call that was added in rev 698 (http://quickfixj.svn.sourceforge.net/viewvc/quickfixj/trunk/core/src/main/java/quickfix/mina/IoSessionResponder.java?view=diff&r1=697&r2=698) is the crux.
      1. responder.patch
        4 kB
        Toli Kuznets
      2. socket_state.txt
        14 kB
        Toli Kuznets
      3. trace.txt
        9 kB
        Toli Kuznets

        Issue Links

          Activity

          Hide
          Toli Kuznets added a comment -

          Output of 'netstat an|grep 7001" (our acceptor socket) command on Linux.
          The main 7001 socket is established, but all the incoming connections are hung.

          Show
          Toli Kuznets added a comment - Output of 'netstat an|grep 7001" (our acceptor socket) command on Linux. The main 7001 socket is established, but all the incoming connections are hung.
          Hide
          Toli Kuznets added a comment -

          Full Java stack trace at time of hang

          Show
          Toli Kuznets added a comment - Full Java stack trace at time of hang
          Hide
          Steve Bate added a comment -

          Damn. The reason the wait functionality is there is because disconnect would frequently complete before the logout message was sent. We could just wait once for a short time (say 50ms) and then do the close. That should generally be enough time for the logout to be sent. If it's not send, it's not a critical problem, but it's polite behavior to send the message.

          Show
          Steve Bate added a comment - Damn. The reason the wait functionality is there is because disconnect would frequently complete before the logout message was sent. We could just wait once for a short time (say 50ms) and then do the close. That should generally be enough time for the logout to be sent. If it's not send, it's not a critical problem, but it's polite behavior to send the message.
          Hide
          Toli Kuznets added a comment -

          Steve, what can we be doing differently in QFJ than others aren't with MINA?
          From the stack trace, it looks like the problem originates in SocketIoProcessor.read (line 227, i had been using MINA 1.1.1)
          It's sitting in the catch(Throwable) inside the read loop - so can it be that there's an error on the socket, it fires an exception, and then we end up waiting for the socket to send out some messages while the socket-related code has already essentially "panicked"?

          as a result, we can't send any outgoing Logout messages out and end up hanging.

          Is this possible? where's the code path that tries to send something out? Or does that mean we just have some "outstanding" messages waiting to go out when the socket code panics, and that it's not necessarily a logout we are waiting to send out?

          Show
          Toli Kuznets added a comment - Steve, what can we be doing differently in QFJ than others aren't with MINA? From the stack trace, it looks like the problem originates in SocketIoProcessor.read (line 227, i had been using MINA 1.1.1) It's sitting in the catch(Throwable) inside the read loop - so can it be that there's an error on the socket, it fires an exception, and then we end up waiting for the socket to send out some messages while the socket-related code has already essentially "panicked"? as a result, we can't send any outgoing Logout messages out and end up hanging. Is this possible? where's the code path that tries to send something out? Or does that mean we just have some "outstanding" messages waiting to go out when the socket code panics, and that it's not necessarily a logout we are waiting to send out?
          Hide
          Steve Bate added a comment -

          I think your analysis probably correct, but without digging into MINA deeply I'm not sure how to verify it. My guess is that a message send has been initiated, there is a socket error, and the pending write count is never decremented (although strictly speaking, there are no more pending writes).

          I'm not sure we are doing anything differently in checking the written message count. However, we can't call close().join() (the related problem) because we use a manual thread model. MINA is typically used with some type of thread pool executor. QF manages the threads in the Connector implementations and I didn't want another thread handoff between MINA and the Connectors. In the long term, I'd like to eliminate the thread management in the Connectors and use a more typical MINA threading architecture. The only sticky issue is that messages must be delivered to Sessions in the order they were received.

          Show
          Steve Bate added a comment - I think your analysis probably correct, but without digging into MINA deeply I'm not sure how to verify it. My guess is that a message send has been initiated, there is a socket error, and the pending write count is never decremented (although strictly speaking, there are no more pending writes). I'm not sure we are doing anything differently in checking the written message count. However, we can't call close().join() (the related problem) because we use a manual thread model. MINA is typically used with some type of thread pool executor. QF manages the threads in the Connector implementations and I didn't want another thread handoff between MINA and the Connectors. In the long term, I'd like to eliminate the thread management in the Connectors and use a more typical MINA threading architecture. The only sticky issue is that messages must be delivered to Sessions in the order they were received.
          Hide
          Toli Kuznets added a comment -

          What if we overload the Session.disconnect() call to take a boolean specifying whether or not to try and wait for messages to be cleared?

          in that case, when we call Session.disconnect() from AbstractIoHandler.exceptionCaught() we tell it to skip waiting for messages to be written, and let it do the waiting in all the other cases?

          Then, if our theory is correct, that should probably fix the problem. What kind of message are we expecting to be sent out? A log out?

          I can try this modification on the Marketcetera Exchange simulator and ask the users to pound the exchange and see what happens.
          Is there any way to exercise this? I thin it may be triggered by having many incoming simultaneous connections on different SessionIDs.

          Show
          Toli Kuznets added a comment - What if we overload the Session.disconnect() call to take a boolean specifying whether or not to try and wait for messages to be cleared? in that case, when we call Session.disconnect() from AbstractIoHandler.exceptionCaught() we tell it to skip waiting for messages to be written, and let it do the waiting in all the other cases? Then, if our theory is correct, that should probably fix the problem. What kind of message are we expecting to be sent out? A log out? I can try this modification on the Marketcetera Exchange simulator and ask the users to pound the exchange and see what happens. Is there any way to exercise this? I thin it may be triggered by having many incoming simultaneous connections on different SessionIDs.
          Hide
          Toli Kuznets added a comment -

          Patch for the propagating a boolean on whether or not to wait for all messages to be written out.
          This is not the most elegant solution, as it adds another function call to Responder API that duplicates the existing one.

          If we choose to go with this solution, i'd vote to modify the Responder API to remove the overloaded call, and only have 1 with the boolean.

          I'll update the bug in a few days after this code runs long enough behind the outward-facing Marketcetera Exchange Simulator (http://exchange.marketcetera.com/marketDisplay) to show there are no other lingering issues with dynamic acceptor sessions

          Show
          Toli Kuznets added a comment - Patch for the propagating a boolean on whether or not to wait for all messages to be written out. This is not the most elegant solution, as it adds another function call to Responder API that duplicates the existing one. If we choose to go with this solution, i'd vote to modify the Responder API to remove the overloaded call, and only have 1 with the boolean. I'll update the bug in a few days after this code runs long enough behind the outward-facing Marketcetera Exchange Simulator ( http://exchange.marketcetera.com/marketDisplay ) to show there are no other lingering issues with dynamic acceptor sessions
          Hide
          Steve Bate added a comment -

          We'll only wait for a limited about of time (up to 50ms) for messages to be written. This should eliminate the hang.

          Show
          Steve Bate added a comment - We'll only wait for a limited about of time (up to 50ms) for messages to be written. This should eliminate the hang.

            People

            • Assignee:
              Steve Bate
              Reporter:
              Toli Kuznets
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development