Wednesday, April 25, 2012

Data Corruption Part 2: Clustered Index Corruption


                In the last post (Part 1) I discussed how to approach a nonclustered index data corruption issue.  What I am going to be explaining here is what to do with data corruption of the clustered index.  I want to start out with a copy of my AdventureWorks2012 database:

restore database AdventureWorks2012_CorruptCI
from disk = 'C:\MyDir\adventureworks2012_base.bak'
with
       move 'adventureworks2012_data' to 'C:\MyDir\AW_Data_CorruptCI.mdf',
       move 'adventureworks2012_log' to 'C:\MyDir\AW_Log_CorruptCI.ldf',
       stats = 5
go

                Now I want to corrupt a clustered index.  Last week in Part 1 I went into a long explanation and demonstration how to corrupt data using a Hex Editor.  I’m not going to rehash, as the process is basically the same.  The only difference between last week’s nonclustered index and this week’s clustered index corruption is that I want to choose the clustered index of a table, and I have personally chose the Person.Person table (you’ll see why soon.  Hint:  constraints).

                So corrupt a clustered index, any clustered index.  Feel free to work along with me and corrupt the Person.Person clustered index.  Now set the database back online and run DBCC CHECKDB:

alter database AdventureWorks2012_CorruptCI
set online
go

dbcc checkdb('AdventureWorks2012_CorruptCI') with no_infomsgs
go

                Notice what you see.  Very much like last week’s result, we get a handful of errors.  Notice the summary of the DBCC command:

CHECKDB found 0 allocation errors and 4 consistency errors in table 'Person.Person' (object ID 1765581328).
CHECKDB found 0 allocation errors and 4 consistency errors in database 'AdventureWorks2012_CorruptCI'.
repair_allow_data_loss is the minimum repair level for the errors found by DBCC CHECKDB (AdventureWorks2012_CorruptCI).

                I have bolded the above portion because I believe it is worth discussing.  SQL Server actually plants the thought in the DBA’s head to use REPAIR_ALLOW_DATA_LOSS with DBCC CHECKDB as the “minimum”.  Now, one may be thinking “well this is the minimum, and SQL Server said it is okay then I think I’ll go right ahead.  It must be the right answer.”  Not always.  And we’ll see why now.  Let’s say you did run DBCC CHECKDB with REPAIR_ALLOW_DATA_LOSS:

alter database AdventureWorks2012_CorruptCI
set single_user
with rollback immediate
go

dbcc checkdb('AdventureWorks2012_CorruptCI', repair_allow_data_loss)
go

                Great!  It finishes, and all the issues were fixed.  So you grab your car keys, call your significant other, and tell them you’re on the way home.  Easy workday, no late dinner tonight!  Or so you thought.  Meanwhile, you’re grinning from ear to ear because you’ve fixed it… but there is something that was missed:  Something very important.  And then you receive the dreaded phone call.  Here’s what was missed.  Take for instance my Person.Person table that I corrupted.  Even after the REPAIR_ALLOW_DATA_LOSS normal queries like this are successful:

select *
from Person.Person

                So what could possibly be the problem?  Here’s what:  REPAIR_ALLOW_DATA_LOSS does just that… it may possibly delete data in order to fix the corruption.  But there is still yet another looming issue here.  The answer is in this next DBCC statement:

use AdventureWorks2012_CorruptCI
go

dbcc checkconstraints
go

                On my end that returned a whopping 25 foreign key relationships that were flagged as culprits:


                What looks like happened is that during my DBCC CHECKDB with the option to REPAIR_ALLOW_DATA_LOSS, SQL Server deleted five rows of data.  Well, because there were foreign key constraints referencing Person.Person, that has now become a pretty big problem (not to mention the underlying fact that we have just deleted data).

                The solution?  There is a good chance that when the root data, in this case the clustered index, is corrupt you will need to utilize a non-corrupt backup that you can restore side-by-side with the corrupt database and copy data over.  This is yet another reason why a great disaster recovery plan is necessary.  Things happen, data gets corrupted.  And there’s not always a quick and easy command, statement, or query to get it back to normal.  Do the appropriate investigation as to what exactly is corrupt, and the best means to fix it.

                The code for this test can be found on my GitHub SQLSalt repository.  A direct link to this test code is here.  NOTE:  THIS CODE AND TEST IS MEANT FOR ISOLATED TEST ENVIRONMENTS ONLY.  DO NOT RUN ANYWHERE NEART PRODUCTION.  If there are any comments, questions, or issues please feel free to leave a comment or email me at sqlsalt@gmail.com.

1 comment:

  1. Beaware that REPAIR_ALLOW_DATA_LOSS can get you in more complicate corruption scenarios, so it should be considerate like the last resource. If you are able to identify the corrupted pages, you can always restore that page from backups, and avoid the REPAIR_ALLOW_DATA_LOSS, if the issue is located outside the primary Filegroup and you are running on enterprise version of SQL server, this can be done fully online :)
    Regards
    Kenneth

    ReplyDelete