Monday, 6 May 2013

Real-world Corporate Browser Stats

Paul Irish recently said on twitter “Very surprised how many developers actively support IE7 despite it's miniscule usage now. Let's move on!” referencing StatCounter showing IE7 having 0.64% usage. A bit of a conversation followed about corporate users and the question was raised about what the stats actually looks like for them. I have got a few systems that are used internally by large corporates in the UK and here are some data on them to hopefully shed some light.

The two systems I have looked at are used internally by corporate users only – there is no public access and little reason for employees to access this from home. “System A” is used by a maybe 10,000 people in a single large UK organisation with more than 50,000 total employees. “System B” is used by about 25 UK corporates averaging about 3,000 – 5,000 employee each. Do note that the numbers are relative and System B added new customers over the period so there is some data that looks like a reversal of a trend while it probably is just the addition of more customers who use older browsers.

Browser stats

I have included tabulated data at the bottom of the post.

image

image

The key finding here is that, although reducing, IE7 is still holding at about 10% and IE8 is 39% and 72% respectively in the two systems even in April 2013. The most surprising finding in many ways is the total dominance of IE. We do see traffic from a wide variety of browsers (including Netscape!) but IE is all-dominant in this limited sample set.

Operating Systems

I have included corresponding information about the operating systems being used to see how much of a correlation there is between IE8 usage and Windows XP.

image

image

The most surprising finding here is that despite 77% of System B’s users being on Windows 7, 39% are still using IE8. I haven’t done any cross-correlation to link OS directly to browser – but I doubt it would give any more real information.

Speculation

The data presented here are obviously not representative for all companies, but they do reflect the reality of nearly 30 large UK companies that have no other characteristics in common. Having worked with these companies over the years, it is clear that they do not take browser upgrades lightly; The browser is part of a standard desktop setup.

I have, however, also found that many of these companies have rules that allow the use of, say, Firefox if the employee needs to use a system that just won’t work in IE7. But it is not installed by default on employees’ workstations.

These days I don’t get much push-back when telling people we won’t support IE7 in new systems – they do understand it is just too old. However, my real pain-point is IE8; It still has some horrible oddities and it doesn’t support media queries natively, nor ECMA Script 5. My standard approach when quoting for systems now is that we can build responsive, mobile-friendly apps if we go for “last 2 versions of all major browsers”, meaning IE9 and IE10. You can include limited functionality on IE8 without too much trouble, but if you want mobile-friendly and IE8 support in the same system then the cost of the project goes up very considerably. Ultimately, customers do get the business case; They can certainly have all the super modern stuff and also work perfectly on IE8 and even IE7 – but there is a significant monetary cost. It is not about whether it is technically possible to do these things, it is whether there is a business case for it. We as developers need to remember that these decisions should be business decisions, not technical decisions, and should be evaluated over the the likely life time of the UI you are building.

I have no basis in the data for it, but given that the highest version of IE you could run on Windows XP was IE8, hopefully the forthcoming de-support of Windows XP will push people up to IE9 or IE10.

Data notes

The data was obtained from IIS logs and the browser and OS was deduced from the user agent string using http://user-agent-string.info/. Entries with “unknown” user agent or operating system was removed from the data set and is not included in “other”.

Tabular data

System A Browser Stats
  IE 6.0 IE 7.0 IE 8.0 IE 9.0 IE 10.0 Chrome Firefox Others
2012-08 0.3% 64.6% 33.9% 0.0% 0.0% 0.6% 0.4% 0.0%
2012-09 1.1% 60.9% 35.2% 0.1% 0.0% 0.3% 2.2% 0.2%
2012-10 1.0% 57.4% 38.3% 0.1% 0.0% 0.5% 2.5% 0.1%
2012-11 1.1% 61.6% 33.6% 0.1% 0.0% 0.4% 3.2% 0.1%
2012-12 1.3% 55.6% 36.9% 0.2% 0.0% 0.5% 5.3% 0.1%
2013-01 1.2% 52.1% 37.4% 0.2% 0.0% 0.8% 8.1% 0.1%
2013-02 0.8% 37.8% 46.7% 0.7% 0.0% 0.6% 13.2% 0.1%
2013-03 0.7% 21.4% 61.0% 11.7% 0.5% 0.6% 3.9% 0.2%
2013-04 0.7% 13.1% 72.1% 9.9% 0.1% 0.3% 3.5% 0.2%
System B Browser Stats
  IE 6.0 IE 7.0 IE 8.0 IE 9.0 IE 10.0 Chrome Firefox Others
2012-08 0% 18% 57% 16% 0% 0% 4% 5%
2012-09 1% 13% 45% 37% 0% 1% 2% 1%
2012-10 1% 11% 36% 47% 0% 2% 3% 0%
2012-11 0% 6% 22% 68% 0% 2% 2% 0%
2012-12 1% 9% 41% 43% 0% 2% 4% 1%
2013-01 1% 9% 48% 36% 0% 3% 3% 1%
2013-02 1% 8% 37% 47% 0% 1% 4% 1%
2013-03 2% 8% 37% 41% 0% 9% 3% 0%
2013-04 1% 7% 39% 48% 0% 1% 2% 1%
System A Operating Systems
  Windows XP Windows 7 Windows 8 Others
2012-08 97% 2% 0% 0%
2012-09 97% 2% 0% 1%
2012-10 97% 2% 0% 1%
2012-11 96% 0% 3% 1%
2012-12 93% 3% 2% 1%
2013-01 89% 2% 8% 1%
2013-02 84% 7% 7% 1%
2013-03 82% 12% 4% 1%
2013-04 85% 10% 3% 1%
System B Operating Systems
Windows XP Windows 7 Windows 8 Windows 2003 Server iOS Others
2012-08 52% 38% 0% 4% 5% 1%
2012-09 39% 56% 0% 3% 1% 1%
2012-10 29% 66% 1% 3% 0% 1%
2012-11 14% 81% 2% 2% 0% 0%
2012-12 26% 69% 1% 4% 1% 1%
2013-01 23% 70% 3% 3% 0% 1%
2013-02 19% 73% 5% 2% 0% 1%
2013-03 20% 74% 3% 2% 0% 1%
2013-04 18% 77% 2% 2% 0% 1%

Wednesday, 1 May 2013

Hard Windows Workflow Lessons from Azure

This post is about some issues we experienced with Workflows on Windows Azure after running successfully for over a year. We effectively had system down for several days until we got to the root of the problem. Not all of it is specific to Azure; there is information in here about certain circumstances where workflow timers won’t fire as well as information about circumstances where “warming up” a workflow service can cause your service to hang – as well as information about some scenarios where you can run into problems with SQL Azure. It’s a bit of a hodgepodge.

The system discussed here uses Windows Workflow 4.0 and .Net 4.0 and the default SQL Persistence Store.

Background

The system we are running on Azure using Windows Workflows has about 100 different workflow definitions and is fairly high traffic with approximately 7,000 new workflow instances being started every day, all of them long running - from hours to weeks.

After running successfully for about a year, one Friday afternoon we started seeing a lot of timeouts when we tried to start or progress workflows. We took the system down and got it back up again for the weekend. On Monday it was fine and then on Tuesday afternoon it died again – and we took it offline and it was okay again for Wednesday morning. Wednesday afternoon it died again and we finally got to the root of the problem on Wednesday night and solved it.

The database side of the problem

We had several different factors conspiring to cause us significant problems.

  1. Over time our workflow database has grown as we (stupidly) chose to keep all history in there instead of letting the workflow engine delete instances when they completed. At the time the problems started, we had almost a million completed workflows in the database.
  2. The workflow engine does a high number of writes to the database and over time that leads to index fragmentation – and on Azure there is no easy way to re-index the database. 
  3. For one reason or another, the workflow engine had built up a backlog of several thousand workflow instances that had overdue timers. This had happened slowly over a period of time and was probably individually due to some of those “connectivity issues” you are told to expect on SQL Azure. There are also scenarios where timers can be blocked from firing which would increase the back log – see below.
  4. In certain scenarios, such as bringing a new server online or rebooting a server, the workflow engine will trigger clean-up processes that will attempt to clear the backlog of overdue timers from the RunnableInstances table.
    Note that each workflow type that you have will run its own clean-up routine as they are individual services. We have over 100 different workflow definitions in the system and when starting a server we have a warmup routine that goes through them and makes sure they are all primed (otherwise timers won’t fire) – this will also trigger the clean up routines and is essential. Our process that pauses between each workflow warmup to give the server a chance to recover, but as soon as the server is online a user may hit a given workflow definition and thus trigger the clean-up,
  5. One of the benefits of SQL Azure is that all databases are replicated three times. However, if you are doing a very large amounts of writes to a database (such as when you are trying to clean up) and the database is slow or has heavily fragmented indexes, SQL Azure’s replication engine will tell you to back off and slow down – which is what happened to us when the cleanup routines were all trying to run at the same time and we had a huge backlog to clear.
  6. The workflow engine had the default connection timeout so would typically time out when told to back off but, crucially, also has retry logic so will keep trying to do its job up to 15 times.

When you combine all of these factors, essentially what you would see is a server trying to do the clean-up, which would cause a lot of writes to the database, which was already slow, which would in turn cause SQL Azure to tell the workflow engine to back off, which would then cause the user to time out and the workflow engine to keep re-trying. The way this would manifest is that the system would start intermittently timing out when interacting with workflows and then, as more and more retries were being done, eventually the SQL Server would reach its connection limit of 180 concurrent connections and would block all connection attempts - which then leads to more retries and the system going into complete gridlock.

So, on the Friday we saw an initial problem which may have been unrelated, but we then responded by adding more servers, which triggered the clean-up routine and eventually brought the system down. On Tuesday something else triggered the same chain reaction and on Wednesday morning we decided to bring another server online to deal with the level of activity from the users who were trying to catch up on work from Tuesday, again triggering the same chain reaction.

We had two instances where we took the system offline and rebooted the servers which calmed the situation down. As the servers were coming back up they started warming up workflows and running clean up routines, which looked good in our staging environment. However, as we then switched it over to production and let users back in, the users would try to catch up with their work and that would cause a cascade that eventually took the system down again as the database was being overloaded.

Things you can do overcome the database issues

  1. Delete completed workflows from the database. At least get rid of the old ones but also consider setting the instanceCompletionAction="DeleteAll" in Web.Config.
  2. If you are able to take your system offline, the easiest way in my experience to reindex and clean up your database on Azure is to rename your workflow database and then do “create database as copy of” to create a new live copy; This copy is actually rebuilt from the data. In our scenario and to give you and idea about the difference it can make; The original database with the 1 million completed workflows was about 4GB in size. We tried to run a query to delete ~700,000 completed workflows but cancelled that query after waiting for 18 minutes. We then created a copy with the “create database as copy of” and ran the same query which ran in just over one minute which is pretty dramatic in anyone’s money.
    Incidentally, the copy was about 25% smaller before we started deleting any data. for good measure we of course did the copy trick again after cleaning up the database.
  3. If you have a backlog of pending timers in your RunnableInstances table, it is vital you let the workflow engine clear them all out. There are things that can stop pending timers from firing (see below) and if you let them build up a backlog, you will eventually end up with a system going down so put some monitoring in place and deal with the problem early.
  4. Increase the timeouts on the workflow database connections to reduce the risk of retries overwhelming the SQL server. Whether the timeout on the connection string is actually respected by the workflow engine I have no idea, but it can’t hurt.

Things that can cause workflow timers not to fire

  1. We found that we had a number of workflows in the RunnableInstances table with a timer in the past but the related instance in the Instance table was marked as Completed. This seems to completely baffle the workflow engine and it doesn’t just ignore that one instance but it seems to stop other timers from firing too. Not all, but many – it seems mildly non-deterministic. So, it is vital you monitor your RunnableInstances table for timers in the past and then investigate. If the real instance is closed, you can just delete the entry from the RunnableInstances table, restart your server and you will see the cleanup routine start moving through the next pending timers.
  2. Not directly related to our woes above, but we also found that some instances don’t appear in the RunnableInstances table when they should. As far as we have been able to figure, the root cause is something like this; Workflow instances can be “locked” by having a SurrogateLockOwnerId assigned to them, which corresponds to an entry in the LockOwners table. One of the clean up routines will clean up expired locks and put workflow instances back in the RunnableInstances table if for whatever reason the “LockOwner” did not finish with it. However, it seems that a situation can arise whereby a workflow instance in the Instances table can be marked with a SurrogateLockOwnerId which does not exist in the LockOwners table. If that happens, it would appear that that instance never gets recovered and never gets back into the RunnableInstances table. Again, something you need to monitor for, We haven’t tested it, but it should be possible to just clear out the SurrogateLockOwnerId from the affected instances in the Instances table and that should clear it up – but you really want to test it first.

In conclusion

Before all of this happened we had treated the whole workflow engine as a bit of a black box and were, frankly, a bit scary of messing with it. We also trusted it too much. Now we have learned a great deal more about the internals of the SQL Persistence Store and have learned how all the bits hang together, how to monitor for problems and how to clear out duff data early before it can start causing problems. The key lesson is to not trust the workflow engine to be able to handle all types of problems.