Uploaded image for project: 'QuickFIX/J'
  1. QuickFIX/J
  2. QFJ-220

QFJ hangs in IoSessionResponder.disconnect() while waiting for scheduled messages to be written

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Default
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.3.0
    • Component/s: Engine
    • Labels:
      None
    • Environment:
      HEAD revision of QFJ (actually, rev 710)

      Description

      This is a continuation of QFJ-215.
      I'm now seeing a new hang that happens in IoSessionResponder.disconnect() on the waitForScheduleMessagesToBeWritten() line.

      Seems that one of the connecting sockets gets the underlying MINA framework in a "bad state" and the IoSessionResonder.waitForScheduleMessagesToBeWritten() function never gets out of the loop that's waiting for the messages to be written.

      I have 2 acceptor sessions running with MINA 1.1.0, and both end up in this state after a few days of being up.
      See the attached logs for the full stack dump, and the output of netstat command (on Ubuntu Linux).

      The relevant portion of the stack trace is:
      "SocketAcceptorIoProcessor-0.0" prio=1 tid=0x082427c0 nid=0x546e sleeping[0xb0c71000..0xb0c71f60]
              at java.lang.Thread.sleep(Native Method)
              at quickfix.mina.IoSessionResponder.waitForScheduleMessagesToBeWritten(IoSessionResponder.java:55)
              at quickfix.mina.IoSessionResponder.disconnect(IoSessionResponder.java:43)
              at quickfix.Session.disconnect(Session.java:1370)
              at quickfix.mina.AbstractIoHandler.exceptionCaught(AbstractIoHandler.java:82)


      Seems that the waitForScheduledMessagesToBeWritten() call that was added in rev 698 (http://quickfixj.svn.sourceforge.net/viewvc/quickfixj/trunk/core/src/main/java/quickfix/mina/IoSessionResponder.java?view=diff&r1=697&r2=698) is the crux.

        Attachments

          Issue Links

            Activity

            Hide
            toli Toli Kuznets added a comment -

            Output of 'netstat an|grep 7001" (our acceptor socket) command on Linux.
            The main 7001 socket is established, but all the incoming connections are hung.

            Show
            toli Toli Kuznets added a comment - Output of 'netstat an|grep 7001" (our acceptor socket) command on Linux. The main 7001 socket is established, but all the incoming connections are hung.
            Hide
            toli Toli Kuznets added a comment -

            Full Java stack trace at time of hang

            Show
            toli Toli Kuznets added a comment - Full Java stack trace at time of hang
            Hide
            admin Steve Bate added a comment -

            Damn. The reason the wait functionality is there is because disconnect would frequently complete before the logout message was sent. We could just wait once for a short time (say 50ms) and then do the close. That should generally be enough time for the logout to be sent. If it's not send, it's not a critical problem, but it's polite behavior to send the message.

            Show
            admin Steve Bate added a comment - Damn. The reason the wait functionality is there is because disconnect would frequently complete before the logout message was sent. We could just wait once for a short time (say 50ms) and then do the close. That should generally be enough time for the logout to be sent. If it's not send, it's not a critical problem, but it's polite behavior to send the message.
            Hide
            toli Toli Kuznets added a comment -

            Steve, what can we be doing differently in QFJ than others aren't with MINA?
            From the stack trace, it looks like the problem originates in SocketIoProcessor.read (line 227, i had been using MINA 1.1.1)
            It's sitting in the catch(Throwable) inside the read loop - so can it be that there's an error on the socket, it fires an exception, and then we end up waiting for the socket to send out some messages while the socket-related code has already essentially "panicked"?

            as a result, we can't send any outgoing Logout messages out and end up hanging.

            Is this possible? where's the code path that tries to send something out? Or does that mean we just have some "outstanding" messages waiting to go out when the socket code panics, and that it's not necessarily a logout we are waiting to send out?

            Show
            toli Toli Kuznets added a comment - Steve, what can we be doing differently in QFJ than others aren't with MINA? From the stack trace, it looks like the problem originates in SocketIoProcessor.read (line 227, i had been using MINA 1.1.1) It's sitting in the catch(Throwable) inside the read loop - so can it be that there's an error on the socket, it fires an exception, and then we end up waiting for the socket to send out some messages while the socket-related code has already essentially "panicked"? as a result, we can't send any outgoing Logout messages out and end up hanging. Is this possible? where's the code path that tries to send something out? Or does that mean we just have some "outstanding" messages waiting to go out when the socket code panics, and that it's not necessarily a logout we are waiting to send out?
            Hide
            admin Steve Bate added a comment -

            I think your analysis probably correct, but without digging into MINA deeply I'm not sure how to verify it. My guess is that a message send has been initiated, there is a socket error, and the pending write count is never decremented (although strictly speaking, there are no more pending writes).

            I'm not sure we are doing anything differently in checking the written message count. However, we can't call close().join() (the related problem) because we use a manual thread model. MINA is typically used with some type of thread pool executor. QF manages the threads in the Connector implementations and I didn't want another thread handoff between MINA and the Connectors. In the long term, I'd like to eliminate the thread management in the Connectors and use a more typical MINA threading architecture. The only sticky issue is that messages must be delivered to Sessions in the order they were received.

            Show
            admin Steve Bate added a comment - I think your analysis probably correct, but without digging into MINA deeply I'm not sure how to verify it. My guess is that a message send has been initiated, there is a socket error, and the pending write count is never decremented (although strictly speaking, there are no more pending writes). I'm not sure we are doing anything differently in checking the written message count. However, we can't call close().join() (the related problem) because we use a manual thread model. MINA is typically used with some type of thread pool executor. QF manages the threads in the Connector implementations and I didn't want another thread handoff between MINA and the Connectors. In the long term, I'd like to eliminate the thread management in the Connectors and use a more typical MINA threading architecture. The only sticky issue is that messages must be delivered to Sessions in the order they were received.
            Hide
            toli Toli Kuznets added a comment -

            What if we overload the Session.disconnect() call to take a boolean specifying whether or not to try and wait for messages to be cleared?

            in that case, when we call Session.disconnect() from AbstractIoHandler.exceptionCaught() we tell it to skip waiting for messages to be written, and let it do the waiting in all the other cases?

            Then, if our theory is correct, that should probably fix the problem. What kind of message are we expecting to be sent out? A log out?

            I can try this modification on the Marketcetera Exchange simulator and ask the users to pound the exchange and see what happens.
            Is there any way to exercise this? I thin it may be triggered by having many incoming simultaneous connections on different SessionIDs.

            Show
            toli Toli Kuznets added a comment - What if we overload the Session.disconnect() call to take a boolean specifying whether or not to try and wait for messages to be cleared? in that case, when we call Session.disconnect() from AbstractIoHandler.exceptionCaught() we tell it to skip waiting for messages to be written, and let it do the waiting in all the other cases? Then, if our theory is correct, that should probably fix the problem. What kind of message are we expecting to be sent out? A log out? I can try this modification on the Marketcetera Exchange simulator and ask the users to pound the exchange and see what happens. Is there any way to exercise this? I thin it may be triggered by having many incoming simultaneous connections on different SessionIDs.
            Hide
            toli Toli Kuznets added a comment -

            Patch for the propagating a boolean on whether or not to wait for all messages to be written out.
            This is not the most elegant solution, as it adds another function call to Responder API that duplicates the existing one.

            If we choose to go with this solution, i'd vote to modify the Responder API to remove the overloaded call, and only have 1 with the boolean.

            I'll update the bug in a few days after this code runs long enough behind the outward-facing Marketcetera Exchange Simulator (http://exchange.marketcetera.com/marketDisplay) to show there are no other lingering issues with dynamic acceptor sessions

            Show
            toli Toli Kuznets added a comment - Patch for the propagating a boolean on whether or not to wait for all messages to be written out. This is not the most elegant solution, as it adds another function call to Responder API that duplicates the existing one. If we choose to go with this solution, i'd vote to modify the Responder API to remove the overloaded call, and only have 1 with the boolean. I'll update the bug in a few days after this code runs long enough behind the outward-facing Marketcetera Exchange Simulator ( http://exchange.marketcetera.com/marketDisplay ) to show there are no other lingering issues with dynamic acceptor sessions
            Hide
            admin Steve Bate added a comment -

            We'll only wait for a limited about of time (up to 50ms) for messages to be written. This should eliminate the hang.

            Show
            admin Steve Bate added a comment - We'll only wait for a limited about of time (up to 50ms) for messages to be written. This should eliminate the hang.

              People

              • Assignee:
                admin Steve Bate
                Reporter:
                toli Toli Kuznets
              • Votes:
                0 Vote for this issue
                Watchers:
                0 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: