Disaster Recovery using Merge

Wait, WHAT?

Does that headline sound fishy? Actually, Diff and Merge (WinMerge, to be exact) were my last resort in this disaster scenario. The final outcome could be told quite shortly, though this scenario serves as a good example how Murphy might hit you anytime… but first things first:

The Scenario

A RAID member disk walks into a bar. Bartender asks what’s wrong.
“Parity error.”
Bartender says: “Yeah, you look a bit off.”

(via @SwiftOnSecurity)

I got called on a Monday to solve a seemingly small issue with a customer’s Oracle DB. It’s 11.2.0.4 Standard Edition on Windows, running on two servers in a standby configuration using custom scripts that ship the archive logs and apply it to the standby db.

The Sunday before, the primary server experienced an outage of the RAID array that held the datafiles. In a later analysis it looked like the RAID controller “decided” to shut down the array because of too many corrupt blocks on one of the hard disks.

The Sysadmins rebooted the server early Monday morning, found the RAID array to be online and did a successful file system check. It seemed that everything was back to normal.

Except for some datafile blocks that were reported by the database as corrupt when queries selected the affected data.

Here’s an excerpt from the alert log:

Reading datafile 'E:\ORACLE\ORADATA\ORC\USERS46.DBF' for corruption at rdba: 0x0d00ba33 (file 52, block 47667)
Reread (file 52, block 47667) found same corrupt data (no logical check)
Mon Jun 03 09:51:20 2019
Corrupt Block Found
         TSN = 6, TSNAME = USERS
         RFN = 52, BLK = 47667, RDBA = 218151475
         OBJN = 140491, OBJD = 140491, OBJECT = ARTIKEL, SUBOBJECT = 
         SEGMENT OWNER = SOME_USER, SEGMENT TYPE = Table Segment
Corrupt Block Found
         TSN = 6, TSNAME = USERS
         RFN = 52, BLK = 47667, RDBA = 218151475
         OBJN = 140491, OBJD = 140491, OBJECT = ARTIKEL, SUBOBJECT = 
         SEGMENT OWNER =  SOME_USER, SEGMENT TYPE = Table Segment
Errors in file C:\ORACLE\diag\rdbms\orc_site1\orc\trace\orc_j005_2904.trc  (incident=43647): (incident=43647):
ORA-01578: ORACLE-Datenblock beschädigt (Datei Nr. 52, Block Nr. 47667)
ORA-01110: Datendatei 52: 'E:\ORACLE\ORADATA\ORC\USERS46.DBF'

This of course affected the application users, as their apps reported the “ORA-1578” back and cancelled the running SQL query. This had to be resolved.

Diagnosis

Let’s go over that quickly, as a seasoned DBA you’ll probably know how to find out the extent of the block corruption:

C:\> rman target /
...
-- Validate the whole db to check for corrupt blocks
RMAN> validate database;
...
RMAN> advise failure;

List of Database Failures
=========================

Failure ID Priority Status    Time Detected Summary
---------- -------- --------- ------------- -------
1829152   HIGH     OPEN      03.06.19        Datafile 58: 'E:\ORACLE\ORADATA\ORC\USERS55.DBF' contains one or more corrupt blocks
1829132   HIGH     OPEN      03.06.19        Datafile 54: 'E:\ORACLE\ORADATA\ORC\USERS51.DBF' contains one or more corrupt blocks
1829012   HIGH     OPEN      03.06.19        Datafile 50: 'E:\ORACLE\ORADATA\ORC\USERS48.DBF' contains one or more corrupt blocks
1828632   HIGH     OPEN      03.06.19        Datafile 56: 'E:\ORACLE\ORADATA\ORC\USERS49.DBF' contains one or more corrupt blocks
1828832   HIGH     OPEN      03.06.19        Datafile 60: 'E:\ORACLE\ORADATA\ORC\USERS53.DBF' contains one or more corrupt blocks
1828672   HIGH     OPEN      03.06.19        Datafile 53: 'E:\ORACLE\ORADATA\ORC\USERS45.DBF' contains one or more corrupt blocks
1828592   HIGH     OPEN      03.06.19        Datafile 57: 'E:\ORACLE\ORADATA\ORC\USERS56.DBF' contains one or more corrupt blocks
1828552   HIGH     OPEN      03.06.19        Datafile 52: 'E:\ORACLE\ORADATA\ORC\USERS46.DBF' contains one or more corrupt blocks

analyzing automatic repair options; this may take some time
allocated channel: ORA_DISK_1
channel ORA_DISK_1: SID=128 device type=DISK
analyzing automatic repair options complete

Mandatory Manual Actions
========================
no manual actions available

Optional Manual Actions
=======================
no manual actions available

Automated Repair Options
========================
Option Repair Description
------ ------------------
1      Restore and recover datafile 58; Restore and recover datafile 52;
Restore and recover datafile 54; ...
  Strategy: The repair includes complete media recovery with no data loss
  Repair script: C:\ORACLE\diag\rdbms\orc_site1\orc\hm\reco_133923557.hm

Recovery

In Standard Edition, block level recovery isn’t available. This means, all affected datafiles have to be restored and recovered completely. The RMAN backup sets are stored on a NAS; Archive Logs can be found on the primary server in the FRA and as a copy on the standby server.

That looked like I could do that quickly in the evening.

So I thought. Murphy tended to disagree:

  • Due to some other failures, the standby db lagged days behind the primary. So, simply switching over was no option.
  • While working on the systems, the RAID array of the standby server crashed as well. Oh my.
  • So I went on recovering the datafiles when suddenly…
RMAN> @C:\ORACLE\diag\rdbms\orc_site112\orc\hm\reco_133923557.hm
 RMAN>    # restore and recover datafile
 2>    sql 'alter database datafile 50, 52, 53, 54, 56, 57, 58, 60 offline';
 SQL-Anweisung: alter database datafile 50, 52, 53, 54, 56, 57, 58, 60 offline
 RMAN>    restore datafile 50, 52, 53, 54, 56, 57, 58, 60;
...
RMAN>    recover datafile 50, 52, 53, 54, 56, 57, 58, 60;
...
-- a number of archive logs get applied
-- when suddenly something bad happens:
...
Archive Log-Dateiname=E:\ORACLE\FRA\ORC_SITE1\ARCHIVELOG\2019_06_02\O1_MF_1_300875_GH6BPJX1_.ARC Thread=1 Sequence=300875
 RMAN-00571: ===========================================================
 RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
 RMAN-00571: ===========================================================
 RMAN-03002: Fehler bei recover Befehl auf 06/03/2019 21:43:09
 ORA-00283: recovery session canceled due to errors
 RMAN-11003: Fehler beim Analysieren/Ausf³hren der SQL-Anweisung: alter database recover logfile 'E:\ORACLE\FRA\ORC_SITE1\ARCHIVELOG\2019_06_02\O1_MF_1_300875_GH6BPJX1_.ARC'
 ORA-00283: Recovery Session wegen Fehlern abgebrochen
 ORA-00354: Fehlerhafter Redo Log-Block-Header
 ORA-00353: Log-Fehler bei Block 77862, Verõnderung von 21614705355 Zeit 06/02/2019 03:45:28
 ORA-00334: Archiviertes Log: 'E:\ORACLE\FRA\ORC_SITE1\ARCHIVELOG\2019_06_02\O1_MF_1_300875_GH6BPJX1_.ARC'

Ouch. One of the Archive Logs is corrupt. So, not only were the datafiles affected by the RAID failure but archive logs as well, it seems.

  • Tried a copy from the standby server. That worked…
  • … for the next 8 corruptions. And then, the copy on the standby server was corrupt as well. Maybe due to the failure of the RAID array just minutes before, who knows.
  • Well, we have a backup, don’t we? Let’s use this!
    • Move the affected archive logs out of their directories
    • Re-catalog the archive logs so RMAN knows about the “missing” ones and looks them up in the according backup set.
    • Re-start the recovery process.
  • Turns out THERE IS NO BACKUP of the corrupted archive log! Because it was corrupt BEFORE getting backed up, so RMAN rejected it.
    (Could this have been discovered beforehand? With proper monitoring: yes, but I digress…)
  • So we’re stuck. A missing archive log in the redo stream means there is no recovery possible beyond this point. We cannot normally open the database, since the datafiles are in an inconsistent state.

So, what are the options from this point on?

  1. Recover ALL datafiles to a point in time before the archive log gap.
    This means data loss of all data including and after the log gap and possibly even logically inconsistent data.
  2. Force open the database by setting some “_” parameter in the init.ora, then immediately rebuild the db by exporting and re-importing the data.
    This also implies data loss and possibly inconsistent data.
  3. Or, make a last-ditch attempt:

And now for something completely different!

While trying to recover from the different archive log locations, I noticed that the corruptions reported were at different positions within the archive log copies.

Let’s double-check this with the “dump logfile” command:

SQL>  alter system dump logfile 'E:\ORACLE\FRA\ORC_SITE1\ARCHIVELOG\2019_06_02\O1_MF_1_300949_GH79NP4M_.ARC' VALIDATE;
 
alter system dump logfile 'E:\ORACLE\FRA\ORC_SITE1\ARCHIVELOG\2019_06_02\O1_MF_1_300949_GH79NP4M_.ARC' VALIDATE
 *
 FEHLER in Zeile 1:
 ORA-00354: Fehlerhafter Redo Log-Block-Header
 ORA-00353: Log-Fehler bei Block 20238, Verõnderung von 21619821386 Zeit
 06/02/2019 12:27:36
 ORA-00334: Archiviertes Log:
 'E:\ORACLE\FRA\ORC_SITE1\ARCHIVELOG\2019_06_02\O1_MF_1_300949_GH79NP4M_.ARC'

-- now, check the other copy:

SQL>  alter system dump logfile 'E:\ORACLE\FRA\ORC_SITE1\ARCHIVELOG\2019_06_02\ORC_0001_870436167_300949.ARC' VALIDATE;

alter system dump logfile 'E:\ORACLE\FRA\ORC_SITE1\ARCHIVELOG\2019_06_02\ORC_0001_870436167_300949.ARC' VALIDATE
 *
 FEHLER in Zeile 1:
 ORA-00354: Fehlerhafter Redo Log-Block-Header
 ORA-00353: Log-Fehler bei Block 136065, Verõnderung von 21619897779 Zeit
 06/02/2019 12:31:32
 ORA-00334: Archiviertes Log:
 'E:\ORACLE\FRA\ORC_SITE1\ARCHIVELOG\2019_06_02\ORC_0001_870436167_300949.ARC'

Looks interesting. Same archivelog, different corruptions. What if the damaged parts in the two files were apart far enough so that the correct content could be copied from one file to another?

Time for the great free and open-source WinMerge to be fired up!

Diff of two archive logs with and without corruption

With a little browsing through the files, it becomes easy to distinguish between the patterns for “healthy” and corrupt content. The screenshot above demonstrates such a difference that can then easily be copied over to the other file.

And now the good news: In all cases, the corrupt data in one file did not overlap with the other. It was possible to merge the archive logs such that a single, healthy one was produced. This archive log was successfully applied to the db (as were some 9 others with similar corruptions).

Lessons Learned

I had all fingers and toes crossed and consumed too much coffee during that process… and got rewarded with a happy end!

Yet, there were some insights to be gained:

  • Even the totally unexpected can happen – in this case a failure on 3 levels:
    • RAID array failure and file corruption on the primary server
    • RAID array failure and corruption of the very same files on the standby server
    • A gap in the backup due to already corrupt files.
  • The outdated standby db could have been discovered earlier with proper monitoring.
  • The gap in the backup could also have been discovered earlier with proper monitoring.

While you usually secure your system for one failure at a time, the other problems could have been known before getting stuck in the midst of the recovery process. The double archive log corruption couldn’t have been avoided, though. But this leads to the most important takeaway:

Never give up!

References

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.