Av rating:
Total votes: 16
Total comments: 4


Alex Kuznetsov
The Case of the Skewed Totals
15 April 2008

Even when your code tests out perfectly  in the standard test cell, you can experience errors in the real production setting where several processes are hitting the database at once, in unpredictable ways.  You shouldn’t, of course, let it get that far, because there are now ways of simulating concurrency during the test process.

Concurrency in databases may cause subtle, difficult to explain, and difficult to reproduce problems. For example, what will happen when several processes are writing to and reading from the database at the same time, in unpredictable ways? Will your reports still be completely correct? Not necessarily.

Believe it or not, but when you simultaneously modify data and run aggregate queries against the same data, your aggregates may intermittently be slightly incorrect. This phenomenon  may go unnoticed for some time, and even when it is noticed, it is difficult to reproduce – in fact such problems are impossible to reproduce without simulating concurrency. This is a very common real life problem. In this article I will reproduce this phenomenon and demonstrate how the accuracy of the information returned by  your SELECT statements may depend on the isolation level.

Prerequisites.

We shall be reusing the StressTests database and StressTester class which were described in the previous article ‘Close Those Loopholes: Stress-Test those Stored Procedures’ . Make sure that read committed snapshot isolation is turned off by executing this code::

ALTER DATABASE StressTests

    SET READ_COMMITTED_SNAPSHOT OFF

Also you will need a few additional tables and stored procedures which are provided in StressTest_MoreDDL.sql file.

Reproducing a select returning incorrect totals.

Before stress testing, your test harness will generate some test data – exactly 10000 users and exactly 10000 tickets assigned to them. During stress testing one thread will be randomly re-assigning tickets to other users, without adding or removing any tickets whatsoever:

    internal class TicketsShuffler : IStressTestable

    {

        private SqlCommand _command;

        private int errors = 0;

 

        public void Prepare(SqlConnection connection)

        {

            _command = connection.CreateCommand();

            _command.CommandText = "Writers.ReassignTicket";

            _command.CommandType = CommandType.StoredProcedure;

            _command.Parameters.AddWithValue("@TicketId", 0);

            _command.Parameters.AddWithValue("@AssignedTo", 0);

        }

 

        public void Run()

        {

            DateTime startedAt = DateTime.Now;

            Random random = new Random();

            const int numIterations = 8000;

            for (int i = 0; i < numIterations; i++)

            {

                try

                {

                    _command.Parameters[0].Value = Convert.ToInt32(random.Next(1, 10000));

                    _command.Parameters[1].Value = Convert.ToInt32(random.Next(1, 10000));

                    _command.ExecuteNonQuery();

                }

                catch (Exception e)

                {

                    Console.WriteLine(e);

                    errors++;

                }

                if (errors > 10)

                {

                    break;

                }

            }

            Console.WriteLine(

                string.Format("TicketsShuffler: {0} completed, {1} failed, duration {2}", numIterations - errors, errors,

                              DateTime.Now.Subtract(startedAt)));

        }

    }

 At the same time another thread will be counting all the tickets in the system:

    internal class TotalsInLoopReader : IStressTestable

    {

        private SqlCommand _command;

        private readonly int[] numbers = new int[2000];

        private int currentIndex = 0;

        private int errors = 0;

 

        public void Prepare(SqlConnection connection)

        {

            _command = connection.CreateCommand();

            _command.CommandText = "Readers.SelectTotalTickets";

            _command.CommandType = CommandType.StoredProcedure;

        }

 

        public void Run()

        {

            DateTime startedAt = DateTime.Now;

            for (int i = 0; i < numbers.Length; i++)

            {

                try

                {

                    numbers[currentIndex++] = Convert.ToInt32(_command.ExecuteScalar());

                }

                catch (Exception e)

                {

                    Console.WriteLine(e);

                    errors++;

                }

                if (errors > 10)

                {

                    break;

                }

            }

            Console.WriteLine(

                string.Format("TotalsInLoopReader: {0} completed, {1} failed, duration {2}", currentIndex, errors,

                              DateTime.Now.Subtract(startedAt)));

            int incorrectTotalsCounter = 0;

            for (int i = 0; i < currentIndex; i++)

            {

                if (numbers[i] != 10000)

                {

                    Console.WriteLine(string.Format("TotalsInLoopReader Iteration {0}, incorrect totals :{1}", i, numbers[i]));

                    incorrectTotalsCounter++;

                }

            }

            Console.WriteLine(

                string.Format("TotalsInLoopReader: {0} incorrect totals out of {1}", incorrectTotalsCounter,

                              currentIndex

                    ));

        }

    }

The tables that store users and tickets and the stored procedure which counts the tickets are as follows:

CREATE TABLE Data.Users(UserId INT NOT NULL,

 FirstName VARCHAR(8) NOT NULL,

 LastName VARCHAR(8) NOT NULL,

 Position VARCHAR(20) NOT NULL,

 LotsOfComments CHAR(5000) NOT NULL,

 CONSTRAINT PK_Users PRIMARY KEY(UserId))

GO

CREATE TABLE Data.Tickets(TicketId INT NOT NULL,

  AssignedTo INT NOT NULL,

  Priority VARCHAR(10) NOT NULL,

  Status VARCHAR(10) NOT NULL,

  LotsOfComments CHAR(5000) NOT NULL,

  CONSTRAINT PK_Tickets PRIMARY KEY(TicketId),

  CONSTRAINT FK_Tickets_Users FOREIGN KEY(AssignedTo) REFERENCES Data.Users(UserId)

)

GO

CREATE INDEX Tickets_AssignedTo ON Data.Tickets(AssignedTo)

GO

CREATE PROCEDURE Readers.SelectTotalTickets

AS

SELECT COUNT(*) AS cnt

  FROM Data.Tickets

GO

Note that it uses the default READ COMMITTED isolation level. Although there are exactly 10000 tickets in the system at any time, the totals in many cases will be slightly different. In fact, when I run the test, the totals are usually incorrect in about 50% cases :

TotalsInLoopReader: 1098 incorrect totals out of 2000

The full test harness is provided in TotalsTest.cs.

As you have seen, the default READ COMMITTED isolation level does not guarantee correct totals, but why is that? Is it some kind of bug in SQL Server? In fact, no, this is not a bug, SQL Server works as designed. The explanation of incorrect totals is very simple: as the select scans the index on AssignedTo column, it does not keep locks on index pages after it has finished reading them. So, if a ticket is reassigned, the corresponding index entry can be moved to another index page. As such, it can be either read twice or not read at all. Note that in this case the totals can both slightly exceed or be slightly less than 10000.

Trying out REPEATABLE READ isolation level.

Let us up the isolation level of the selects to REPEATABLE READ:

ALTER PROCEDURE Readers.SelectTotalTickets

AS

SET TRANSACTION ISOLATION LEVEL REPEATABLE READ

SELECT COUNT(*) AS cnt

  FROM Data.Tickets

Rerun the stress test. This time the number of incorrect totals has decreased and in this case all incorrect totals are less than 10000. There are no incorrect totals exceeding 10000 this time. The reason is simple: to ensure that the reads are repeatable, this isolation level may prevent an index entry from moving to another index page. This means that a reassigned ticket will not be counted twice. However, this isolation level does not prevent index entries that were not counted yet from moving to an already scanned page.

Trying out SERIALIZABLE isolation level.

Let us up the isolation level to SERIALIZABLE:

ALTER PROCEDURE Readers.SelectTotalTickets

AS

SET TRANSACTION ISOLATION LEVEL SERIALIZABLE

SELECT COUNT(*) AS cnt

  FROM Data.Tickets

Rerun the stress test. This time all the totals are always correct, but there are so many deadlocks (on my laptop up to 50% of selects are deadlock victims) that the price for 100% correctness of results is in many cases prohibitive.

Using snapshot isolation.

Let us make sure that snapshot isolation is enabled, as follows:

ALTER DATABASE StressTests

    SET ALLOW_SNAPSHOT_ISOLATION ON

When your selects run under snapshot isolation level, as follows:

ALTER PROCEDURE Readers.SelectTotalTickets

AS

SET TRANSACTION ISOLATION LEVEL SNAPSHOT

SELECT COUNT(*) AS cnt

  FROM Data.Tickets

GO

These selects and modifications do not block each other. As a result, all your selects always return correct results, and there are no deadlocks involving a reader and a writer. Also because there is no lock waiting, the stress test completes significantly faster. Rerun the stress test and see for yourself.

Using dirty reads aka READ UNCOMMITTED isolation level.

Run the following SQL script and rerun the stress test:

ALTER PROCEDURE Readers.SelectTotalTickets

AS

SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED

SELECT COUNT(*) AS cnt

  FROM Data.Tickets

Note that your selects run faster and you are getting less incorrect totals than when you were using the default READ COMMITTED isolation level. The explanation is straightforward: selects running under READ UNCOMMITTED isolation level acquire less locks. As such, they run faster, so the chance of an index entry counted twice or not counted at all is smaller.

Conclusion

Solid understanding of concurrency is essential in database programming. I think that experimenting with concurrently running queries is a very good way to acquire such understanding. In this article I have demonstrated why selects may work perfectly without concurrency yet intermittently fail in real life multi user environments.

As you have seen, in this particular case the best choice is clear: the query running under snapshot isolation always returns correct results, and it never fails due to deadlocks. Although the best choice in your particular situation may be different, the choice of isolation level may be very important. Do not make this choice lightly – make sure that you verify your choice against a realistic amount of data, and perform enough realistic concurrency tests.



This article has been viewed 1845 times.
Alex Kuznetsov

Author profile: Alex Kuznetsov

Alex Kuznetsov has been working with databases for more than a decade. He is a SQL Server MVP. Currently he leads database development and optimizes database performance for a trading firm.

Search for other articles by Alex Kuznetsov

Rate this article:   Avg rating: from a total of 16 votes.


Poor

OK

Good

Great

Must read
 
Have Your Say
Do you have an opinion on this article? Then add your comment below:
You must be logged in to post to this forum

Click here to log in.


Subject: Snapshot isolation sounds good, but not a panacea
Posted by: McAir (not signed in)
Posted on: Wednesday, April 16, 2008 at 7:37 AM
Message: Snapshot isolation is implemented using row-versioning (in tempdb). SQL2005 supposedly has increased performance across tempdb operations, but the fact remains: snapshot makes extensive use of tempdb, so instead of lock-pressure when running in default isolation mode, I would anticipate seeing pressure on tempdb in a high-throughput environment.

Subject: snapshot isolation and performance
Posted by: Eric Russell (view profile)
Posted on: Wednesday, April 16, 2008 at 1:50 PM
Message: Regarding the usage of tempdb by snapshot isolation, I have a couple of thoughts.
First, you should avoid reporting (or even bulk aggregate totaling) on OLTP databases with heavy insert / update / delete traffic. This should be done on a designated ODS server or perhaps on a "database snapshot" (different from "snapshot isolation"). However, many decision support type applications that operate off of real time data may not have this option available.
Second, there are methods for improving the performance of the tempdb database like placing it's data file on a seperate dedicated I/O subsystem that is optimzied for writes and preallocating the size of the file to prevent autogrow operations.

Subject: snapshot isolation and performance
Posted by: Ian Ringrose (not signed in)
Posted on: Thursday, April 17, 2008 at 11:44 AM
Message: If you have enough RAM (think 64 bit) then the tempdb will never hit disk, as it does not have to be committed to stable storage as it is wiped clean on each reboot. RAM is cheep these days, so provided the “reporting” queries are limited in number and running time, sizing the server should be able to solve the problem of tempdb IO.

However using snapshot isolation does slow down all data changes (include when the changes are done from a different isolation type), so do benchmark the effect it has on your OLTP load.

Subject: Clearly snapshot isolation is not free
Posted by: Alex Kuznetsov (view profile)
Posted on: Saturday, April 19, 2008 at 6:05 PM
Message: Clearly snapshot isolation is not free - it does have a price tag, and it may be considered expensive. Unfortunately, in my benchmarking the only other solution which never returns incorrect results is significantly more expensive. Serializable isolation level requires some serious locking and as such it uses up a lot of CPU.
Of course, if the business tolerated slightly incorrect results, that would be an entirely different situation.

However, my main point is not a blanket recommendation to use snapshot isolation in all the cases no matter what. My point is quite different: the choice of isolation level is very important - make this choice based on benchmarking, and be aware of the downstream ramifications of your choice.

 









Phil Factor
Bunnikins!
 When an IT manager is selected as a victim of office politics of a large corporate, it is time for him to engage in... Read more...



 View the blog
On the Trail of the Expanding Databases
 It is sometimes difficult for other IT people to understand the constraints that DBAs have to work... Read more...

Using Powershell to Generate Table-Creation Scripts
 For all of us who learn best by trying out examples, Bob Sheldon produces a PowerShell script file for... Read more...

SQL Toolbelt 2008: Predominantly an Engineering Task
 The conversion of the Red-Gate tools to be compatible with SQL Server 2008 might not seem, on first... Read more...

Audit Crosschecks
 In this short article, the second of a 2-part series, William suggests a solution, using SQL Data... Read more...

XML Jumpstart Workbench
 In which Robyn and Phil decide that the best way of starting to learn XML is to jump in and take a ride... Read more...

Beginning SQL Server 2005 Reporting Services Part 1
 Steve Joubert begins an in-depth tour of SQL Server 2005 Reporting Services with a step-by-step guide... Read more...

Ten Common Database Design Mistakes
 Database design and implementation is the cornerstone of any data centric project (read 99.9% of... Read more...

SQL Server Full Text Search Language Features
 SQL Full-text Search (SQL FTS) is an optional component of SQL Server 7 and later, which allows fast... Read more...

Beginning SQL Server 2005 Reporting Services Part 2
 Continuing his in-depth tour of SQL Server 2005 Reporting Services, Steve Joubert demonstrates the most... Read more...

Executing SSIS Packages
 Nigel Rivett demonstrates how to execute all SSIS packages in a given folder using either an SSIS... Read more...

Over 150,000 Microsoft professionals subscribe to the Simple-Talk technical journal. Join today, it's fast, simple, free and secure.

Join Simple Talk