Tuesday, May 8, 2012

Delete a Large Amount of Data in Batches


                There may be times when you need to delete a large amount of data and there isn’t the comfort of a large maintenance window, or you just simply can’t wait for some down time in production.  Issuing a basic DELETE FROM could potentially cause contention and drastically minimize concurrency for the given time of the data delete.  With just a few rows of data, this is typically no big deal and the end users will be none-the-wiser.  But with a large amount of data to be deleted, it could have a great impact on the users, and performance in general.  One solution is to delete the data in batches.  As a set-based proponent, this is one of those times where we could really benefit from a loop.

                I want to create some test data to do our experimenting on (I know this goes without saying, but only run a variation of this code in production after extensive testing in your particular environment).  I’m using a database called TestDB, so feel free to create this database or use your own test database (it is not dependent on any objects, so it can be a newly created DB).

use TestDB
go

drop table LargeDeleteTest
go

create table LargeDeleteTest
(
       id int identity(1, 1) not null
              primary key clustered,
       SomeData nvarchar(128) not null
              default replicate('a', 128)
)
go

insert into LargeDeleteTest
values(default)
go 1250

                So now we have a test table (LargeDeleteTest) with 1250 rows of data in it.  I am over-simplifying this scenario, as in the grand scheme of data this is negligible.  But as we go through this example, apply this thought process to millions upon millions of rows that need to be deleted.

                Let’s demonstrate what it’ll look like if you don’t delete data in batches.  By executing the below code, notice the return of the DMV sys.dm_tran_locks query:

begin tran
       delete
       from LargeDeleteTest
--commit tran

select
       resource_database_id,
       resource_associated_entity_id,
       request_session_id,
       request_mode,
       request_type,
       request_status,
       resource_type,
       count(*) as resource_count
from sys.dm_tran_locks
group by
       resource_database_id,
       resource_associated_entity_id,
       request_session_id,
       request_mode,
       request_type,
       request_status,
       resource_type

                We get an expected 1250 exclusive key locks on the clustered index of our LargeDeleteTest table.  This is 100% of the table.  In other words, until this DELETE transaction is committed, there is an exclusive lock on all of that data.  (Note: run the commented out COMMIT TRAN in order to commit that transaction and not hinder our experimentation further down this post)

                Here’s some code to show you how to delete data in batches to skirt around the above concurrency nightmare:

declare
       @batch_size int,
       @del_rowcount int = 1

set @batch_size = 100

set nocount on;

while @del_rowcount > 0
       begin
              begin tran
                     delete top (@batch_size)
                     from dbo.LargeDeleteTest

                     set @del_rowcount = @@rowcount
                     print 'Delete row count: ' + cast(@del_rowcount as nvarchar(32))
              commit tran
       end

                The code is very simple.  All it does it loops through and deletes the data in batches of 100 (I chose batch size to illustrate the point on a smaller scale, but as explained above you should alter this variable for your specific environment).  To continue with the details used above to show locks, execute the below code, which is basically just one iteration of the WHILE loop:

insert into LargeDeleteTest
values(default)
go 1250

begin tran
       delete top (100)
       from LargeDeleteTest
--commit tran

select
       resource_database_id,
       resource_associated_entity_id,
       request_session_id,
       request_mode,
       request_type,
       request_status,
       resource_type,
       count(*) as resource_count
from sys.dm_tran_locks
group by
       resource_database_id,
       resource_associated_entity_id,
       request_session_id,
       request_mode,
       request_type,
       request_status,
       resource_type

                Now in the result set of querying sys.dm_tran_locks we see that we only have 100 exclusive key locks.  We see that concurrency has now been optimized due to deleting the data in batches, as the whole clustered index will not have an exclusive lock on all keys, just the current ones in the batch.

                As we’ve seen here, there are going to be times that we will need to delete a large amount of data, and it may not be the best idea to just do it in a single DELETE FROM statement.  Through the use of a loop and batch deletes, we can minimize the impact this will have on our underlying data’s accessibility.

                The batch delete code can be found on my GitHub SQLSalt repository here:  DML_Delete_DeleteLargeDataByBatches.sql.  If there are any comments, questions, or issues please feel free to leave a comment below or email me at sqlsalt@gmail.com.

No comments:

Post a Comment