[QFJ-569] QuickFix/J is not scalable due to overly long duration lock on sequence number in Session.sendRaw Created: 10/Dec/10  Updated: 10/Dec/10

Status: Open
Project: QuickFIX/J
Component/s: Engine
Affects Version/s: 1.5.0
Fix Version/s: None

Type: Improvement Priority: Major
Reporter: Leon Chadwick Assignee: Unassigned
Resolution: Unresolved Votes: 1
Labels: None


 Description   

I have been writing a capacity test to measure the performance of our FIX engine (not quickfix based), however I can barely get my machine to use any CPU during the test despite my using a large number of threads to generate orders and respond with Acks.
Sadly I will have to now drop usage of quickfix/j as this is a showstopper for a high performance system.

After profiling, I tracked down that all these threads are generally single threaded through quickfix.Session.sendRaw() which is locking to hold onto a sequence number. A scalable solution should keep locking to a minimum duration, i.e. do as much message checking and string conversion as possible, then fill in the sequence number as a short last step whilst inside the lock.

A StringBuilder could be used to hold the toString()'d message whilst not holding the lock and leave sufficient capacity at the head of the array to shift some of the header content left/right to accomodate the sequence number field whilst during the locked region.



 Comments   
Comment by Steve Bate [ 10/Dec/10 ]

Are you trying to send all these orders through one session? If so, the incoming messages must be processed sequentially by the session protocol engine. However, that doesn't mean your application interface implementation must be single threaded. The application implementation can hand off the message processing to a worker thread pool, for example. The downside is that the FIX engine won't be able to rollback the sequence numbers if an exception occurs in the application code.

As the comments state in sendRaw(), the lock is held until the message is processed and the application callback has completed. This is necessary because any exception thrown during this time will cause the sequence number to be rolled back. In other words, the sequence number can't be incremented until these operations complete successfully.

The message throughput depends on many factors. Some of the biggest factors include the number of sessions, the implementations of the message store and message log and the response time of the application implementation. For example, any logging to the disk in the application callback could have a significant impact on message throughput and CPU utilization.

I don't know if any of this applies to your situation since you didn't include many details. However, I know of other organizations that have reported throughput in the 10's of thousands of messages/second.

Generated at Sun May 05 22:14:19 UTC 2024 using JIRA 7.5.2#75007-sha1:9f5725bb824792b3230a5d8716f0c13e296a3cae.