There may be times when you need
to delete a large amount of data and there isn’t the comfort of a large
maintenance window, or you just simply can’t wait for some down time in
production. Issuing a basic DELETE FROM
could potentially cause contention and drastically minimize concurrency for the
given time of the data delete. With just
a few rows of data, this is typically no big deal and the end users will be
none-the-wiser. But with a large amount
of data to be deleted, it could have a great impact on the users, and
performance in general. One solution is
to delete the data in batches. As a
set-based proponent, this is one of those times where we could really benefit
from a loop.
I want to create some test data
to do our experimenting on (I know this goes without saying, but only run a
variation of this code in production after extensive testing in your particular
environment). I’m using a database
called TestDB, so feel free to create this database or use your own test
database (it is not dependent on any objects, so it can be a newly created DB).
use TestDB
go
drop table LargeDeleteTest
go
create table LargeDeleteTest
(
id int identity(1, 1) not null
primary key clustered,
SomeData nvarchar(128) not null
default replicate('a', 128)
)
go
insert into LargeDeleteTest
values(default)
go 1250
So now we have a test table
(LargeDeleteTest) with 1250 rows of data in it.
I am over-simplifying this scenario, as in the grand scheme of data this
is negligible. But as we go through this
example, apply this thought process to millions upon millions of rows that need
to be deleted.
Let’s demonstrate what it’ll
look like if you don’t delete data in
batches. By executing the below code,
notice the return of the DMV sys.dm_tran_locks query:
begin tran
delete
from LargeDeleteTest
--commit tran
select
resource_database_id,
resource_associated_entity_id,
request_session_id,
request_mode,
request_type,
request_status,
resource_type,
count(*) as resource_count
from sys.dm_tran_locks
group by
resource_database_id,
resource_associated_entity_id,
request_session_id,
request_mode,
request_type,
request_status,
resource_type
We get an expected 1250
exclusive key locks on the clustered index of our LargeDeleteTest table. This is 100% of the table. In other words, until this DELETE
transaction is committed, there is an exclusive lock on all of that data. (Note: run the commented out COMMIT TRAN
in order to commit that transaction and not hinder our experimentation further
down this post)
Here’s some code to show you how
to delete data in batches to skirt around the above concurrency nightmare:
declare
@batch_size int,
@del_rowcount int = 1
set @batch_size = 100
set nocount on;
while @del_rowcount > 0
begin
begin tran
delete top (@batch_size)
from dbo.LargeDeleteTest
set @del_rowcount = @@rowcount
print 'Delete row count: '
+ cast(@del_rowcount as nvarchar(32))
commit tran
end
The code is very simple. All it does it loops through and deletes the
data in batches of 100 (I chose batch size to illustrate the point on a smaller
scale, but as explained above you should alter this variable for your specific
environment). To continue with the details
used above to show locks, execute the below code, which is basically just one
iteration of the WHILE loop:
insert into LargeDeleteTest
values(default)
go 1250
begin tran
delete top (100)
from LargeDeleteTest
--commit tran
select
resource_database_id,
resource_associated_entity_id,
request_session_id,
request_mode,
request_type,
request_status,
resource_type,
count(*) as resource_count
from sys.dm_tran_locks
group by
resource_database_id,
resource_associated_entity_id,
request_session_id,
request_mode,
request_type,
request_status,
resource_type
Now in the result set of
querying sys.dm_tran_locks
we see that we only have 100 exclusive key locks. We see that concurrency has now been
optimized due to deleting the data in batches, as the whole clustered index
will not have an exclusive lock on all keys, just the current ones in the
batch.
As we’ve seen here, there are
going to be times that we will need to delete a large amount of data, and it
may not be the best idea to just do it in a single DELETE FROM statement. Through the use of a loop and batch deletes,
we can minimize the impact this will have on our underlying data’s
accessibility.
The batch delete code can be
found on my GitHub SQLSalt
repository here: DML_Delete_DeleteLargeDataByBatches.sql. If there are any comments, questions, or
issues please feel free to leave a comment below or email me at sqlsalt@gmail.com.
No comments:
Post a Comment