Wednesday, 25 April 2012

Windows Workflow Foundation 4 State Machine and tight loops

I am a big fan of Windows Workflow. Sure, there are a lot of barriers to get started and a lot of idiosyncrasies, but once you get past that, it can do an awful lot to help with building large, complex systems.

One issue I have come across recently is that workflows using the State Machine can easily go into a tight loop that will use up all the CPU on your web server if you are hosting it in IIS. You may experience that your web server goes to 100% CPU usage (w3wp.exe) for a period of time but there is no information in your logs to find out what is going on.

It turns out that if you have an empty trigger on a transition but the condition blocks it, WWF may go into a tight loop.

A simple example

A simple scenario for this might be that you are designing the workflow and you put in a transition “to be done later” and you just set the condition to False to stop it firing.


In the above simple example, the transition from T1 to T2 has no trigger but the Condition is set to False. You’d think that this would just make the transition never fire, but what actually happens is that the workflow goes into a tight loop where it is continually testing the condition, thus eating up all the CPU on your web server and with nothing in the log to show for it. Note that it is possible that if you run this very simple example it may not actually cause the tight loop – it can sometimes depend on other factors.

A more realistic example

In practice, the above example is very simplistic, a more realistic example is something like this:


In the above example, the idea is that you have a workflow that needs to wait for a number of documents to arrive into the system and when they have all arrived, do something. In the example, we have a “waiting for docs” state. Every time that state is entered, it checks how many documents are still pending and writes that to a variable. The “Doc Received” transition in this example is a an external event, for example a WCF ReceiveAndSendReply. The trigger, in turn, checks the variable and moves on once it gets to zero. So, the idea is that each time a document is received, the workflow is prodded and recalculates the number of pending docs and once the count reaches zero, move on. Alas, this will cause a tight loop and consume all your CPU.

The above example can be implemented like this to avoid the problem:


In essence, the arrival of a new document moves to a new state called Interim. That has two transitions coming out from it, both with empty triggers. However, one of them will be true and the other will be false so they will both be evaluated but the workflow will then either move back to “Waiting for docs” or on to “All docs received”, thus avoiding a tight loop.

The root cause

I hesitate to call this a bug. In essence, as long as a state machine is sitting in a state, it will continually test all trigger-less transitions. I suspect this is by design, though I don’t like it; I would rather that the trigger-less transitions were only tested once, when the state had finished its “entry”. If I did want a state to continually check all the triggers, it would be trivial for me to then add another transition that just uses a delay to loop back to the same state, but I would then be in control of how “tight” I would allow the loop to be.

How to identify the problem

The annoying thing with this problem is that you are unlikely to notice this when you are developing on your nice multi-core development machine; This workflow will just tie up one core but it will actually work functionally correctly. So, it isn’t until you deploy to your webserver and you have many of these workflows (or use single-core server instances) that everything falls apart.

If you are experiencing this 100% CPU usage, try enabling tracing in your workflow by putting this in your web.config:

    <source name="System.Activities" switchValue="Information">
        <add name="textListener" type="System.Diagnostics.TextWriterTraceListener" initializeData="WorkflowTraceLog.txt" traceOutputOptions="ProcessId, DateTime" />
        <remove name="Default" />
  <trace autoflush="true" indentsize="4" />

Top tip: If you are working with state machines in Windows Workflow Foundation 4, I suggest you enable this tracing on your development box and just keep an eye on the size of the log file and delete it occasionally. If it all of a sudden gets very big, chances are you have a tight loop somewhere.