Wednesday, 30 October 2013

Hitting the limits in SQL Azure

We use SQL Azure for most of our systems. For the most part, it “just works” and it is dead easy to get set up and start using. The main challenge facing most people is the need to handle retries, which is far from simple – but if you use Entity Framework then version 6 has you covered (which some caveats).

There are, however, situations where you can push SQL Azure too hard and you will have problems. In this post I will relay some of our experiences and how we went about resolving them. Once you know what you are looking for it isn’t terribly hard – but until then you will likely be tearing your hair out.

Understanding SQL Azure limits

At the time of this writing SQL Azure does not enforce limits on how hard you can hit the SQL Server – if there is capacity then you will be okay, but if the server happens to be busy then you will start seeing problems. This makes this very hard to test and plan for. However, rumour has it that Microsoft are thinking about enforcing some limits – which is actually a good thing as it will allow you to test things.

In the documentation for SQL Azure you will be told that when you are being throttled you will see a specific exception telling you this. I have never seen one of those exceptions, but I have seen plenty of timeouts. So, important point; just because you are not seeing “throttled” exceptions doesn’t mean that you are not, well, being throttled. It just means that the throttling is happening at a deeper layer and the SQL server isn’t able to bubble up the exception to your application.

The key metric you are probably used to looking at when working with SQL is how much memory and CPU you are consuming on the SQL Server. Sure enough, they are important to keep under control, but the primary bottleneck in SQL Azure right now is I/O and specifically writing to disk. In other words, the metrics you really want to look at are the reads and the writes. For the purpose of that discussion, the important term is IOPS – I/O Operations Per Second. In terms of SQL, one IOPS is one page from SQL being read or written. So, if your operation causes 100 write IOPS then it means that 100 8kb pages were written to disk. Remember that one page may have multiple records on it.

One way to reduce writes can be to have fewer indexes, which sounds counter intuitive; The idea is that you make writes cheaper as they are the primary bottleneck, in return for making reads more expensive, as reads are less of a bottleneck. What you also need to understand, however, is that some read queries actually write to the disk as well – more about this later.

Do note that Microsoft are rumoured to be be doing something about the I/O limitations at some point in the future, which may render this post irrelevant.

At the moment there are no official limits on how much you can hit SQL Azure, but if you try to keep your application under “several hundred” IOPS of either kind then you are probably ok.

Our story

We were very intermittently seeing problems with one of our bigger applications. Once in a while for 20 minutes or so the database would slow to a crawl and we would start seeing SQL Timeout exceptions being thrown by the dozen every minute. We could not figure out what was going and was looking at monitoring the network layer to see if there was a network issue and various other ideas. We spoke with Azure support who were nice and friendly but didn’t really move us very much forward. The fact that we weren’t seeing any of those famous “throttling” exceptions combined with the fact that we knew we weren’t really hitting the database all that hard lead us to believe that something in the Azure infrastructure was failing. Whether it was the network or SQL Azure or what – but we couldn’t see how it could be our fault.

Eventually we got to speak to a highly skilled SQL Azure Engineer who explained about the bottlenecks in SQL Azure and was able to show from the back-end telemetry that we were absolutely hammering the poor database, frequently exceeding 1,600 write IOPS (which is a lot in case you wonder). We were also pretty high on reads and CPU but, as was explained to us, it was the writes that was causing the real problem and was basically choking the server, leading to the timeouts.

We reviewed our audit logs and, well, we aren’t actually writing that many records and certainly not nearly enough to account for this many writes to the database so puzzlement ensued. The next step was to install some software on a worker role to get some telemetry about what was really going on (see below for more details).

It turned out that there was a specific read query that users run quite a lot that, because of the way it was written, was causing a very large temp table to be created with hundreds of thousands of records in it. Every time the query ran, it would cause 1.3 million reads and 3,400 writes. Just running it once is enough to spike the write IOPS way over the top and under certain circumstances a user was able to run it tens of times in quick succession.

The culprit turned out to be a read query that looked at a table with about 300,000 records (so not really a lot). We use Entity Framework for data access and most of the time it is spot on, but some times it can do something rather less than optimal.

The original code looked something like this:

from c in Cases.OfType<EmploymentCase>
let application = c.Applications.FirstOrDefault()

The modified code looked something like this:

from a in Applications
where a.Case is EmploymentCase

In terms of the difference between the two:

  CPU Reads Writes
Old 2,2801,319,925 3,422
New 16 7,026 0

How you can detect and fix this

This is one of those where the tooling provided by Azure is less than optimal; you can’t run SQL Profiler against your Azure SQL database – and even if you could it would probably be hard to see what was going on in a live environment.

In your local development environment you probably have the database locally on a nice, fast SSD so you probably won’t notice problem queries, even if you test on a copy of the live database. Even up in Azure, most of the time you will be fine; the bad old query above would return in about 2 seconds in most cases so didn’t raise any major concerns.

The simplest thing you can do is to fire up SQL Profiler and work through your application locally and try to monitor for queries that cause a lot of reads or writes – but it really is looking for a needle in a haystack. Still, I recommend you do it – if for no other reason than to spot excessive database chatter. Also consider using Glimpse or MiniProfiler to get information in the UI about excessive or slow database calls.

Really, though, you want to know how hard you are hitting your production database and which queries are the culprit. There is information in the database itself about queries and the cost of them, but it is hard to use this information in it’s raw form, primarily because the cost of a query is recorded as all-time cost since SQL first saw that particular query. There is a tool available on MSDN which can run on a worker role, which will read the performance every few minutes and calculate the deltas so you can actually use the information. We received our copy direct from Microsoft, but I believe the version on MSDN is the same or similar, though I have not verified this. Suffice to say, once that tool had run for 30 minutes on the live database we had found the problem and had fixed it less than an hour later. Highly recommended!

In conclusion

You and only you own the performance and availability of your application. With all the nice wrapper layers we use in the code and our nice, fast development machines it is very easy to release something to production that will sometimes overload SQL Azure and bring your application down. Own the problem and monitor your systems.

I do wish that Microsoft would make some of this SQL Telemetry available in the management portal; Even if all they do is show us the SQL CPU and IOPS usage, we can at least tell if there is a problem and take appropriate action.